Databricks Certified Data Engineer Professional Practice Exam
Practice across all ten exam domains — from production-grade PySpark and SQL development to streaming, cost and performance optimization, security and compliance, governance, observability, and deployment with the Databricks CLI, REST API, and Asset Bundles. Get immediate feedback in Learn mode and a full 120-minute simulation in Exam mode. Start with a 24-hour free trial.
Start 24-hour free trial →Exam at a glance
- Exam
- Databricks Certified Data Engineer Professional
- Format
- Multiple choice, proctored (online or test center)
- Scored questions
- 59 (additional unscored items may appear)
- Time limit
- 120 minutes
- Registration fee
- $200 USD, plus applicable local taxes
- Prerequisites
- None; related training highly recommended
- Recommended experience
- Production data engineering on Databricks; the Associate first if newer
- Passing standard
- Databricks does not publish a fixed numeric passing score
- Code languages
- Primarily Python and SQL
- Validity
- 2 years; recertify by taking the current exam
- Languages
- English, Japanese, Portuguese (BR), Korean
Source: Databricks — Certified Data Engineer Professional · Exam Guide PDF (Nov 2025)
About this certification
The Data Engineer Professional is Databricks’ advanced data-engineering credential. It validates that you can build, optimize, and maintain production-grade solutions on the Databricks Data Intelligence Platform: designing secure, reliable, cost-effective ETL pipelines; processing complex data with Python and SQL; implementing streaming workloads; and applying best practices in schema management, observability, governance, and performance optimization. It assumes real command of Delta Lake, Unity Catalog, Auto Loader, Lakeflow Spark Declarative Pipelines, serverless compute, Lakeflow Jobs, and the Medallion architecture, plus DevOps and CI/CD with the Databricks CLI, REST API, and Asset Bundles.
The exam is hard and scenario-heavy: questions present production situations — a skewed streaming job, a CDC requirement, a cost-versus-latency trade-off, a compliance constraint — and ask for the best design, often with a Python or SQL snippet to read or reason about. It is meaningfully deeper than the Associate, which focuses on foundational pipelines. For foundational reading on advanced data engineering with Databricks, see the Data Engineering Learning Hub guide.
Exam domains and weights
The exam is divided into ten domains. Weights are taken directly from the official Databricks exam page; approximate question counts are derived from the 59 scored questions and rounded.
The largest domain. Writing production-grade PySpark and SQL for complex data processing — the core skill the whole exam leans on.
Advanced ingestion — Auto Loader, streaming sources, and acquiring data from diverse systems into governed tables.
Deduplication, CDC, quality checks, and reliable Silver/Gold transformations across batch and streaming.
Sharing data securely (e.g. Delta Sharing) and querying federated sources without copying data.
Observability for production pipelines — metrics, logs, system tables, and alerting on pipeline health.
Tuning Spark and Delta — diagnosing skew and shuffle, Liquid Clustering, predictive optimization, and right-sizing compute for cost.
PII handling, masking and anonymization, encryption, and consistent compliance across batch and streaming pipelines.
Unity Catalog governance at scale — privileges, lineage, row/column controls, and managed vs. external tables.
DevOps and CI/CD — debugging production issues and deploying with the Databricks CLI, REST API, and Asset Bundles across environments.
Modeling for the lakehouse — Medallion layers, slowly changing dimensions (Type 1 vs. Type 2), and analytics-ready structures.
Who this exam is for
This credential targets experienced data engineers, senior and lead engineers, and analytics or platform engineers who run production pipelines on Databricks. There are no formal prerequisites, so anyone can register, but in practice this exam assumes you have built real production systems and typically have around a year or more of hands-on Databricks experience. Deep fluency in both PySpark and SQL, plus solid command of Delta Lake, Unity Catalog, streaming, and deployment tooling, is effectively expected.
If you are newer to the platform or still learning the fundamentals, start with the Data Engineer Associate — it covers the foundational pipeline, ingestion, and governance concepts this exam builds on. Attempting the Professional without that base often leads to a difficult, expensive sitting. For role-by-role salary ranges and senior-engineer paths, see the Career Hub — Data Engineer role guide.
What this practice exam delivers
Learn mode
Answer one question at a time with the explanation revealed immediately — ideal for the code-development and optimization domains, where reasoning through a production scenario and a code snippet is the whole point.
Exam mode
59 questions against a 120-minute timer — the real exam format. Build the stamina and pacing the long, scenario-heavy Professional exam demands before test day.
Source-linked explanations
Every answer cites the Databricks documentation it derives from — Structured Streaming, Delta Lake, Unity Catalog, Asset Bundles — so you can verify the reasoning and dig deeper.
Score by exam domain
Results break down across all ten domains, so practice tells you exactly which area — code, optimization, security, deployment, modeling — to study next.
Sample practice questions
Ten free questions spanning the ten exam domains, each with a full explanation of why the other answers are wrong. The complete bank is available with the 24-hour trial.
A nightly batch job merges a new batch of records into an events Delta table, where event_id is a unique key. New records that share an event_id with an existing row should update it, and brand-new event_ids should be inserted. Which operation expresses this directly?
- INSERT INTO events SELECT * FROM new_events
- MERGE INTO events USING new_events ON events.event_id = new_events.event_id WHEN MATCHED THEN UPDATE … WHEN NOT MATCHED THEN INSERT …
- CREATE OR REPLACE TABLE events AS SELECT * FROM new_events
- DELETE FROM events; INSERT INTO events SELECT * FROM new_events
Show answer & explanation
Correct: B — MERGE. A Delta MERGE on the event_id key performs an upsert: matched rows are updated and unmatched rows inserted, exactly the described behavior, atomically.
Why not the others: plain INSERT (A) duplicates matching keys instead of updating; CREATE OR REPLACE (C) discards all existing history and rewrites the whole table; DELETE-then-INSERT (D) is non-atomic and loses unrelated rows. MERGE is the upsert primitive.
Source: Databricks — Delta MERGE →A long-running nightly batch ETL processes very large JSON volumes into Delta. Performance and reliable completion are the top priorities, with cost a secondary concern. Which compute choice fits best?
- A job cluster that autoscales across multiple workers for the duration of the run
- An all-purpose cluster kept always-on for low-latency startup
- A single-node cluster with minimal workers to cut cost
- A high-concurrency cluster designed for interactive SQL
Show answer & explanation
Correct: A — an autoscaling job cluster. Job clusters spin up for the run and tear down after, and autoscaling adds workers to handle the large volume reliably and quickly — the right balance of performance and cost for scheduled batch ETL.
Why not the others: an always-on all-purpose cluster (B) wastes money between runs and is meant for interactive work; a single small node (C) sacrifices the performance and reliability the scenario prioritizes; a high-concurrency cluster (D) targets many interactive SQL users, not one heavy batch job.
Source: Databricks — compute & cluster types → Further reading: PowerKram — Cost & Performance Optimization →Predictive Optimization is enabled by default for Unity Catalog managed tables to keep Delta tables performant. Which two maintenance operations does it run on your behalf? (Choose two.)
- OPTIMIZE (compaction)
- PARTITION BY
- BUCKETING
- VACUUM
Show answer & explanation
Correct: A and D — OPTIMIZE and VACUUM. Predictive Optimization automatically runs file compaction (OPTIMIZE) and stale-file cleanup (VACUUM) on managed tables, removing the need to schedule them manually.
Why not the others: PARTITION BY (B) and BUCKETING (C) are physical layout choices made at table design time, not automated maintenance operations. Predictive Optimization handles OPTIMIZE and VACUUM.
Source: Databricks — Predictive Optimization →A streaming pipeline reads a source that occasionally redelivers the same record. The engineer must drop duplicates within the stream based on a unique key while keeping state bounded over time. Which approach fits best?
- Collect the full stream to the driver and call distinct() in memory
- dropDuplicates on the key combined with a watermark to bound state
- Disable checkpointing so old keys are forgotten automatically
- Write all records and deduplicate manually once a year
Show answer & explanation
Correct: B. Streaming deduplication uses dropDuplicates on the key, and a watermark bounds how long state is retained so it does not grow without limit — the standard pattern for exactly-once-style dedup in Structured Streaming.
Why not the others: collecting to the driver (A) does not scale and breaks streaming; disabling checkpointing (C) loses progress and fault tolerance rather than managing state; deferring dedup to a yearly batch (D) fails the requirement to drop duplicates in-stream. Watermarked dropDuplicates is correct.
Source: Databricks — streaming watermarks → Further reading: PowerKram — Structured Streaming & Quality →PII (email, phone, IP) arrives in both daily batch files and a real-time stream. It must be masked before storage, handled consistently across both pipelines, and remain auditable and reproducible. Which design meets all requirements?
- Store PII unmasked in Bronze for lineage, then mask only in Gold reporting tables
- Apply a shared, version-controlled masking/anonymization transformation in both batch and streaming paths before writing, governed and audited via Unity Catalog
- Mask only the batch path, since streaming data is transient
- Email a masked extract to analysts and keep raw PII everywhere else
Show answer & explanation
Correct: B. A single shared masking transformation applied in both pipelines before write guarantees consistent handling; version control makes it reproducible; and Unity Catalog governance/lineage makes it auditable — satisfying every stated requirement.
Why not the others: storing unmasked PII in Bronze (A) violates "masked before storage"; masking only batch (C) leaves streaming non-compliant; emailing extracts while keeping raw PII (D) fails masking and auditability. Consistent, governed, version-controlled masking is the compliant design.
Source: Databricks — column masks & row filters → Further reading: PowerKram — Data Security & Compliance →A team needs a modular, version-controlled way to package and promote Lakeflow Jobs and pipelines across dev, test, and prod through automated CI/CD. Which Databricks tooling supports this directly?
- Manually exporting and importing notebooks per environment
- Databricks Asset Bundles defined in code, versioned in Git, validated and deployed via the Databricks CLI in CI/CD
- Relying on notebook revision history as version control
- Copying jobs through the workspace UI by hand
Show answer & explanation
Correct: B — Asset Bundles + Databricks CLI. Asset Bundles define jobs, pipelines, and config as code, versioned in Git and deployed through the CLI in automated CI/CD — the supported, repeatable promotion mechanism across environments.
Why not the others: manual export/import (A) and UI copying (D) are error-prone and unversioned; notebook revision history (C) is not real packaging or environment promotion. Bundles are the deployment unit the exam expects.
Source: Databricks — Asset Bundles → Further reading: PowerKram — CI/CD & Deployment →An engineer must track historical query and job behavior across the workspace — who ran what, how long it took, and how costs trend — to build alerting on regressions. Which Databricks feature is the governed, queryable source for this?
- Screenshots of the Spark UI saved manually
- System tables (the system catalog) in Unity Catalog
- Printing logs to notebook output
- Asking each engineer to self-report
Show answer & explanation
Correct: B — system tables. Unity Catalog system tables expose governed, queryable operational data (billing/usage, query history, job runs, lineage), which you can query with SQL and build alerts on — the right foundation for monitoring and cost-trend analysis.
Why not the others: manual Spark UI screenshots (A) and notebook log prints (C) are point-in-time and not queryable at scale; self-reporting (D) is not a data source. System tables are the governed observability layer.
Source: Databricks — system tables →For auditing, a team must retain the full history of how each customer’s address changed over time. An engineer argues a Type 1 dimension plus Delta time travel is enough; another prefers Type 2. Which fact is most critical to the decision?
- Type 1 with time travel is the simplest and always sufficient for long-term history
- Delta time travel is not designed as a long-term versioning/audit solution — retention and cost/latency make it unsuitable, so a Type 2 dimension is the durable way to keep full history
- Type 2 dimensions cannot represent historical changes
- Time travel retains all versions forever at no cost
Show answer & explanation
Correct: B. Delta time travel is bounded by retention settings and is not meant for indefinite auditing; a Type 2 slowly changing dimension explicitly stores every historical version with effective dates, making it the durable choice for long-term history.
Why not the others: Type 1 overwrites and relies on time travel that expires (A); Type 2 is precisely what represents historical change (C); and time travel does not retain everything forever for free (D). The critical fact is time travel’s unsuitability for long-term versioning.
Source: Databricks — Delta time travel & history → Further reading: PowerKram — Dimensional Modeling on the Lakehouse →An organization must share live Delta data with an external partner who uses a different platform, without copying the data or building a custom API. Which Databricks capability is designed for this?
- Emailing nightly CSV exports
- Delta Sharing
- Granting the partner full workspace admin access
- Mounting the partner’s laptop to the cluster
Show answer & explanation
Correct: B — Delta Sharing. Delta Sharing is the open protocol for securely sharing live data with recipients on any platform without copying it or building bespoke APIs — exactly the cross-platform, zero-copy requirement.
Why not the others: CSV exports (A) are stale, copied snapshots; granting workspace admin (C) is a severe over-privilege and not a sharing mechanism; mounting a laptop (D) is nonsensical. Delta Sharing is purpose-built for governed external sharing.
Source: Databricks — Delta Sharing →An engineer must restrict an analytics group so it can read a customers table but cannot see the rows for EU customers, enforced centrally and reusably across multiple tables. Which Unity Catalog capability fits best?
- Make a manual filtered copy of the table for the group
- A row filter / ABAC policy in Unity Catalog applied to the relevant tables
- Tell the group not to query EU rows
- Revoke all access and email results on request
Show answer & explanation
Correct: B. Unity Catalog row filters — and ABAC policies for centrally managed, reusable enforcement — restrict which rows a group sees at query time across tables, without duplicating data. That matches "enforced centrally and reusably."
Why not the others: a manual filtered copy (A) duplicates data and drifts; relying on the group not to query rows (C) is not a control; revoking access and emailing results (D) defeats self-service and governance. Row filters/ABAC are the governed mechanism.
Source: Databricks — Unity Catalog ABAC →Keep going: Learning & Career resources
This certification pays off fastest when it sits on top of real platform skills and a clear sense of where the role leads. Two PowerKram hubs back this exam up.
Deep dive: exam structure, scoring, study path & recertification
Exam structure and how it’s scored
The exam delivers 59 scored multiple-choice questions in 120 minutes; additional unscored items may appear for calibration, with extra time factored in. Databricks does not publish a fixed numeric passing score on the official exam page, and your result is reported as pass or fail. Questions are advanced and scenario-heavy, frequently presenting a production situation or a Python/SQL snippet and asking for the best design or fix; code is primarily in Python and SQL. Read the exam-format deep dive →
What the ten domains actually test, and what changed
Developing Code for Data Processing (22%) is by far the largest domain, followed by Cost & Performance Optimisation (13%); Transformation/Quality, Monitoring, Security & Compliance, and Debugging & Deploying sit at 10% each, with Ingestion, Governance, Data Modelling, and Data Sharing & Federation rounding out the rest. Older materials describe "6 domains," "60 questions," or a fixed "70%" pass mark; those reflect a previous version. The current exam emphasizes Lakeflow Spark Declarative Pipelines, serverless compute, streaming, and deployment with the CLI, REST API, and Asset Bundles. Read the advanced Databricks toolchain guide →
Realistic study path
Most candidates need real production experience plus focused review. A workable path: earn (or be solid on) the Data Engineer Associate first, then take Databricks Academy’s Advanced Data Engineering content and the streaming, performance-optimization, data-privacy, and automated-deployment modules. Build an end-to-end production-style project — streaming ingestion with Auto Loader, CDC and dedup with MERGE and watermarks, Medallion modeling with Type 2 dimensions, governance and masking in Unity Catalog, observability via system tables, and deployment with Asset Bundles — and practice reading the Spark UI to diagnose skew and spill. Read the study plan →
Cost, scheduling, and delivery
The registration fee is $200 USD plus applicable local taxes, and there are no free retakes. The exam is proctored and can be taken online or at a test center, in English, Japanese, Portuguese (Brazil), and Korean. Online delivery requires a quiet private space and a system check through the proctoring provider. Databricks periodically offers discount vouchers through learning events. Verify current fees and scheduling on Databricks’ official page before booking. Databricks’ official certification page →
Recertification
The certification is valid for two years. To stay certified you retake and pass the current version of the exam before it expires — there is no continuing-education-credit alternative. Because Databricks refreshes the exam to track platform changes (the current version reorganized the domains and emphasizes Lakeflow, serverless, streaming, and Asset Bundles), recertifying also keeps your validated skills current. Read the recertification guide →
Career outlook
The Professional credential is a senior-engineer differentiator: it signals you can design secure, cost-effective, scalable pipelines, troubleshoot complex production issues, and apply governance and compliance at scale. It maps to senior/lead data engineer, data architect, and principal engineer roles, which command premium compensation. The credential is most valuable paired with demonstrable production work. For salary ranges and role-specific paths, see the Career Hub. Career Hub — Data Engineer →
Frequently asked questions
Is the Databricks Data Engineer Professional exam hard?
Yes — it is an advanced exam and noticeably harder than the Associate. The 59 questions are scenario-heavy and assume real production experience: you will reason through streaming, CDC, optimization, security, and deployment situations, often with a Python or SQL snippet to interpret. Developing Code for Data Processing alone is 22% of the exam, so deep PySpark and SQL fluency is essential, not optional.
What is the passing score?
Databricks does not publish a fixed numeric passing score on the official exam page; results are reported as pass or fail based on overall performance across all questions. You may see "70%" (or "closer to 80%") quoted on third-party sites, but those figures are not confirmed by Databricks, so treat them as unofficial and aim to be strong across every domain rather than targeting a specific percentage.
How many questions and domains are on the exam? I’ve seen different numbers.
The current exam has 59 scored multiple-choice questions in 120 minutes, organized into ten domains, with Developing Code for Data Processing (22%) and Cost & Performance Optimisation (13%) the heaviest. Older materials cite "6 domains" and "60 questions," and some report "65"; those reflect a previous version. Study against the current ten-domain structure on the official exam page and in the latest exam guide.
Should I take the Associate first?
For most people, yes. The Professional assumes you already know the foundational pipeline, ingestion, and governance concepts covered by the Data Engineer Associate, and it builds well beyond them. If you have substantial production Databricks experience (roughly a year or more) you can attempt the Professional directly, but you should still review Associate-level topics to avoid gaps. If you are newer to the platform, start with the Associate.
What topics carry the most weight?
Prioritize Developing Code for Data Processing (22%) — production PySpark and SQL — and Cost & Performance Optimisation (13%). After those, Transformation & Quality, Monitoring & Alerting, Security & Compliance, and Debugging & Deploying are 10% each. Make sure you can handle streaming with Auto Loader and watermarks, MERGE/CDC, Delta optimization, Unity Catalog governance and masking, observability via system tables, and deployment with Asset Bundles.
How does it differ from the Data Engineer Associate?
The Associate validates foundational pipeline building — ingestion, PySpark/SQL transformations, orchestration, basic CI/CD, and governance — in a 45-question, 90-minute exam. The Professional is a 59-question, 120-minute exam focused on production-grade design: advanced code, streaming, cost and performance optimization, security and compliance, observability, and deployment at scale. Choose the Professional once you operate real production pipelines and want a senior-level credential.
Start your free 24-hour practice trial
Full access to the question bank, both study modes, and domain-level scoring across all ten exam areas. No credit card required.
Start free trial →