What is the passing score for the exam?

Databricks does not publish a fixed numeric passing score on the official exam page; results are reported as pass or fail based on overall performance. A 70% (or closer to 80%) figure is sometimes quoted on third-party sites but is not confirmed by Databricks, so treat it as unofficial and aim to be strong across every domain.

Databricks · Practice Exam · Updated for 2026

Databricks Certified Data Engineer Professional Practice Exam

Practice across all ten exam domains — from production-grade PySpark and SQL development to streaming, cost and performance optimization, security and compliance, governance, observability, and deployment with the Databricks CLI, REST API, and Asset Bundles. Get immediate feedback in Learn mode and a full 120-minute simulation in Exam mode. Start with a 24-hour free trial.

Start 24-hour free trial →

500+

Practice questions

Exam domains covered

Study modes

24h

Free trial access

Exam at a glance

Exam: Databricks Certified Data Engineer Professional
Format: Multiple choice, proctored (online or test center)
Scored questions: 59 (additional unscored items may appear)
Time limit: 120 minutes
Registration fee: $200 USD, plus applicable local taxes
Prerequisites: None; related training highly recommended
Recommended experience: Production data engineering on Databricks; the Associate first if newer
Passing standard: Databricks does not publish a fixed numeric passing score
Code languages: Primarily Python and SQL
Validity: 2 years; recertify by taking the current exam
Languages: English, Japanese, Portuguese (BR), Korean

Source: Databricks — Certified Data Engineer Professional · Exam Guide PDF (Nov 2025)

About this certification

The Data Engineer Professional is Databricks’ advanced data-engineering credential. It validates that you can build, optimize, and maintain production-grade solutions on the Databricks Data Intelligence Platform: designing secure, reliable, cost-effective ETL pipelines; processing complex data with Python and SQL; implementing streaming workloads; and applying best practices in schema management, observability, governance, and performance optimization. It assumes real command of Delta Lake, Unity Catalog, Auto Loader, Lakeflow Spark Declarative Pipelines, serverless compute, Lakeflow Jobs, and the Medallion architecture, plus DevOps and CI/CD with the Databricks CLI, REST API, and Asset Bundles.

The exam is hard and scenario-heavy: questions present production situations — a skewed streaming job, a CDC requirement, a cost-versus-latency trade-off, a compliance constraint — and ask for the best design, often with a Python or SQL snippet to read or reason about. It is meaningfully deeper than the Associate, which focuses on foundational pipelines. For foundational reading on advanced data engineering with Databricks, see the Data Engineering Learning Hub guide.

Exam domains and weights

The exam is divided into ten domains. Weights are taken directly from the official Databricks exam page; approximate question counts are derived from the 59 scored questions and rounded.

Developing Code for Data Processing (Python & SQL)

The largest domain. Writing production-grade PySpark and SQL for complex data processing — the core skill the whole exam leans on.

22%~13 questions

Data Ingestion & Acquisition

Advanced ingestion — Auto Loader, streaming sources, and acquiring data from diverse systems into governed tables.

7%~4 questions

Data Transformation, Cleansing, and Quality

Deduplication, CDC, quality checks, and reliable Silver/Gold transformations across batch and streaming.

10%~6 questions

Data Sharing and Federation

Sharing data securely (e.g. Delta Sharing) and querying federated sources without copying data.

5%~3 questions

Monitoring and Alerting

Observability for production pipelines — metrics, logs, system tables, and alerting on pipeline health.

10%~6 questions

Cost & Performance Optimisation

Tuning Spark and Delta — diagnosing skew and shuffle, Liquid Clustering, predictive optimization, and right-sizing compute for cost.

13%~8 questions

Ensuring Data Security and Compliance

PII handling, masking and anonymization, encryption, and consistent compliance across batch and streaming pipelines.

10%~6 questions

Data Governance

Unity Catalog governance at scale — privileges, lineage, row/column controls, and managed vs. external tables.

7%~4 questions

Debugging and Deploying

DevOps and CI/CD — debugging production issues and deploying with the Databricks CLI, REST API, and Asset Bundles across environments.

10%~6 questions

Data Modelling

Modeling for the lakehouse — Medallion layers, slowly changing dimensions (Type 1 vs. Type 2), and analytics-ready structures.

6%~4 questions

Who this exam is for

This credential targets experienced data engineers, senior and lead engineers, and analytics or platform engineers who run production pipelines on Databricks. There are no formal prerequisites, so anyone can register, but in practice this exam assumes you have built real production systems and typically have around a year or more of hands-on Databricks experience. Deep fluency in both PySpark and SQL, plus solid command of Delta Lake, Unity Catalog, streaming, and deployment tooling, is effectively expected.

If you are newer to the platform or still learning the fundamentals, start with the Data Engineer Associate — it covers the foundational pipeline, ingestion, and governance concepts this exam builds on. Attempting the Professional without that base often leads to a difficult, expensive sitting. For role-by-role salary ranges and senior-engineer paths, see the Career Hub — Data Engineer role guide.

What this practice exam delivers

Learn mode

Answer one question at a time with the explanation revealed immediately — ideal for the code-development and optimization domains, where reasoning through a production scenario and a code snippet is the whole point.

Exam mode

59 questions against a 120-minute timer — the real exam format. Build the stamina and pacing the long, scenario-heavy Professional exam demands before test day.

Source-linked explanations

Every answer cites the Databricks documentation it derives from — Structured Streaming, Delta Lake, Unity Catalog, Asset Bundles — so you can verify the reasoning and dig deeper.

Score by exam domain

Results break down across all ten domains, so practice tells you exactly which area — code, optimization, security, deployment, modeling — to study next.

Sample practice questions

Ten free questions spanning the ten exam domains, each with a full explanation of why the other answers are wrong. The complete bank is available with the 24-hour trial.

Question 1 · Developing Code for Data Processing

A nightly batch job merges a new batch of records into an events Delta table, where event_id is a unique key. New records that share an event_id with an existing row should update it, and brand-new event_ids should be inserted. Which operation expresses this directly?

INSERT INTO events SELECT * FROM new_events
MERGE INTO events USING new_events ON events.event_id = new_events.event_id WHEN MATCHED THEN UPDATE … WHEN NOT MATCHED THEN INSERT …
CREATE OR REPLACE TABLE events AS SELECT * FROM new_events
DELETE FROM events; INSERT INTO events SELECT * FROM new_events

Show answer & explanation

Correct: B — MERGE. A Delta MERGE on the event_id key performs an upsert: matched rows are updated and unmatched rows inserted, exactly the described behavior, atomically.

Why not the others: plain INSERT (A) duplicates matching keys instead of updating; CREATE OR REPLACE (C) discards all existing history and rewrites the whole table; DELETE-then-INSERT (D) is non-atomic and loses unrelated rows. MERGE is the upsert primitive.

Source: Databricks — Delta MERGE →

Question 2 · Cost & Performance Optimisation

A long-running nightly batch ETL processes very large JSON volumes into Delta. Performance and reliable completion are the top priorities, with cost a secondary concern. Which compute choice fits best?

A job cluster that autoscales across multiple workers for the duration of the run
An all-purpose cluster kept always-on for low-latency startup
A single-node cluster with minimal workers to cut cost
A high-concurrency cluster designed for interactive SQL

Show answer & explanation

Correct: A — an autoscaling job cluster. Job clusters spin up for the run and tear down after, and autoscaling adds workers to handle the large volume reliably and quickly — the right balance of performance and cost for scheduled batch ETL.

Why not the others: an always-on all-purpose cluster (B) wastes money between runs and is meant for interactive work; a single small node (C) sacrifices the performance and reliability the scenario prioritizes; a high-concurrency cluster (D) targets many interactive SQL users, not one heavy batch job.

Source: Databricks — compute & cluster types → Further reading: PowerKram — Cost & Performance Optimization →

Question 3 · Cost & Performance Optimisation

Predictive Optimization is enabled by default for Unity Catalog managed tables to keep Delta tables performant. Which two maintenance operations does it run on your behalf? (Choose two.)

OPTIMIZE (compaction)
PARTITION BY
BUCKETING
VACUUM

Show answer & explanation

Correct: A and D — OPTIMIZE and VACUUM. Predictive Optimization automatically runs file compaction (OPTIMIZE) and stale-file cleanup (VACUUM) on managed tables, removing the need to schedule them manually.

Why not the others: PARTITION BY (B) and BUCKETING (C) are physical layout choices made at table design time, not automated maintenance operations. Predictive Optimization handles OPTIMIZE and VACUUM.

Source: Databricks — Predictive Optimization →

Question 4 · Data Transformation, Cleansing, and Quality

A streaming pipeline reads a source that occasionally redelivers the same record. The engineer must drop duplicates within the stream based on a unique key while keeping state bounded over time. Which approach fits best?

Collect the full stream to the driver and call distinct() in memory
dropDuplicates on the key combined with a watermark to bound state
Disable checkpointing so old keys are forgotten automatically
Write all records and deduplicate manually once a year

Show answer & explanation

Correct: B. Streaming deduplication uses dropDuplicates on the key, and a watermark bounds how long state is retained so it does not grow without limit — the standard pattern for exactly-once-style dedup in Structured Streaming.

Why not the others: collecting to the driver (A) does not scale and breaks streaming; disabling checkpointing (C) loses progress and fault tolerance rather than managing state; deferring dedup to a yearly batch (D) fails the requirement to drop duplicates in-stream. Watermarked dropDuplicates is correct.

Source: Databricks — streaming watermarks → Further reading: PowerKram — Structured Streaming & Quality →

Question 5 · Ensuring Data Security and Compliance

PII (email, phone, IP) arrives in both daily batch files and a real-time stream. It must be masked before storage, handled consistently across both pipelines, and remain auditable and reproducible. Which design meets all requirements?

Store PII unmasked in Bronze for lineage, then mask only in Gold reporting tables
Apply a shared, version-controlled masking/anonymization transformation in both batch and streaming paths before writing, governed and audited via Unity Catalog
Mask only the batch path, since streaming data is transient
Email a masked extract to analysts and keep raw PII everywhere else

Show answer & explanation

Correct: B. A single shared masking transformation applied in both pipelines before write guarantees consistent handling; version control makes it reproducible; and Unity Catalog governance/lineage makes it auditable — satisfying every stated requirement.

Why not the others: storing unmasked PII in Bronze (A) violates "masked before storage"; masking only batch (C) leaves streaming non-compliant; emailing extracts while keeping raw PII (D) fails masking and auditability. Consistent, governed, version-controlled masking is the compliant design.

Source: Databricks — column masks & row filters → Further reading: PowerKram — Data Security & Compliance →

Question 6 · Debugging and Deploying

A team needs a modular, version-controlled way to package and promote Lakeflow Jobs and pipelines across dev, test, and prod through automated CI/CD. Which Databricks tooling supports this directly?

Manually exporting and importing notebooks per environment
Databricks Asset Bundles defined in code, versioned in Git, validated and deployed via the Databricks CLI in CI/CD
Relying on notebook revision history as version control
Copying jobs through the workspace UI by hand

Show answer & explanation

Correct: B — Asset Bundles + Databricks CLI. Asset Bundles define jobs, pipelines, and config as code, versioned in Git and deployed through the CLI in automated CI/CD — the supported, repeatable promotion mechanism across environments.

Why not the others: manual export/import (A) and UI copying (D) are error-prone and unversioned; notebook revision history (C) is not real packaging or environment promotion. Bundles are the deployment unit the exam expects.

Source: Databricks — Asset Bundles → Further reading: PowerKram — CI/CD & Deployment →

Question 7 · Monitoring and Alerting

An engineer must track historical query and job behavior across the workspace — who ran what, how long it took, and how costs trend — to build alerting on regressions. Which Databricks feature is the governed, queryable source for this?

Screenshots of the Spark UI saved manually
System tables (the system catalog) in Unity Catalog
Printing logs to notebook output
Asking each engineer to self-report

Show answer & explanation

Correct: B — system tables. Unity Catalog system tables expose governed, queryable operational data (billing/usage, query history, job runs, lineage), which you can query with SQL and build alerts on — the right foundation for monitoring and cost-trend analysis.

Why not the others: manual Spark UI screenshots (A) and notebook log prints (C) are point-in-time and not queryable at scale; self-reporting (D) is not a data source. System tables are the governed observability layer.

Source: Databricks — system tables →

Question 8 · Data Modelling

For auditing, a team must retain the full history of how each customer’s address changed over time. An engineer argues a Type 1 dimension plus Delta time travel is enough; another prefers Type 2. Which fact is most critical to the decision?

Type 1 with time travel is the simplest and always sufficient for long-term history
Delta time travel is not designed as a long-term versioning/audit solution — retention and cost/latency make it unsuitable, so a Type 2 dimension is the durable way to keep full history
Type 2 dimensions cannot represent historical changes
Time travel retains all versions forever at no cost

Show answer & explanation

Correct: B. Delta time travel is bounded by retention settings and is not meant for indefinite auditing; a Type 2 slowly changing dimension explicitly stores every historical version with effective dates, making it the durable choice for long-term history.

Why not the others: Type 1 overwrites and relies on time travel that expires (A); Type 2 is precisely what represents historical change (C); and time travel does not retain everything forever for free (D). The critical fact is time travel’s unsuitability for long-term versioning.

Source: Databricks — Delta time travel & history → Further reading: PowerKram — Dimensional Modeling on the Lakehouse →

Question 9 · Data Sharing and Federation

An organization must share live Delta data with an external partner who uses a different platform, without copying the data or building a custom API. Which Databricks capability is designed for this?

Emailing nightly CSV exports
Delta Sharing
Granting the partner full workspace admin access
Mounting the partner’s laptop to the cluster

Show answer & explanation

Correct: B — Delta Sharing. Delta Sharing is the open protocol for securely sharing live data with recipients on any platform without copying it or building bespoke APIs — exactly the cross-platform, zero-copy requirement.

Why not the others: CSV exports (A) are stale, copied snapshots; granting workspace admin (C) is a severe over-privilege and not a sharing mechanism; mounting a laptop (D) is nonsensical. Delta Sharing is purpose-built for governed external sharing.

Source: Databricks — Delta Sharing →

Question 10 · Data Governance

An engineer must restrict an analytics group so it can read a customers table but cannot see the rows for EU customers, enforced centrally and reusably across multiple tables. Which Unity Catalog capability fits best?

Make a manual filtered copy of the table for the group
A row filter / ABAC policy in Unity Catalog applied to the relevant tables
Tell the group not to query EU rows
Revoke all access and email results on request

Show answer & explanation

Correct: B. Unity Catalog row filters — and ABAC policies for centrally managed, reusable enforcement — restrict which rows a group sees at query time across tables, without duplicating data. That matches "enforced centrally and reusably."

Why not the others: a manual filtered copy (A) duplicates data and drifts; relying on the group not to query rows (C) is not a control; revoking access and emailing results (D) defeats self-service and governance. Row filters/ABAC are the governed mechanism.

Source: Databricks — Unity Catalog ABAC →

Keep going: Learning & Career resources

This certification pays off fastest when it sits on top of real platform skills and a clear sense of where the role leads. Two PowerKram hubs back this exam up.

📚 Learning Hub — Data Engineering Deep guides on PySpark, Structured Streaming, Delta Lake optimization, Unity Catalog governance, Asset Bundles, and observability — the concepts behind every Professional exam domain, not just the answers. Explore data engineering guides → 💼 Career Hub — Data Engineer Senior engineering paths (senior/lead data engineer, data architect, principal engineer) with salary ranges, expected skills, and how this Professional credential fits each path. See data engineer career paths →

Deep dive: exam structure, scoring, study path & recertification

Exam structure and how it’s scored

The exam delivers 59 scored multiple-choice questions in 120 minutes; additional unscored items may appear for calibration, with extra time factored in. Databricks does not publish a fixed numeric passing score on the official exam page, and your result is reported as pass or fail. Questions are advanced and scenario-heavy, frequently presenting a production situation or a Python/SQL snippet and asking for the best design or fix; code is primarily in Python and SQL. Read the exam-format deep dive →

What the ten domains actually test, and what changed

Developing Code for Data Processing (22%) is by far the largest domain, followed by Cost & Performance Optimisation (13%); Transformation/Quality, Monitoring, Security & Compliance, and Debugging & Deploying sit at 10% each, with Ingestion, Governance, Data Modelling, and Data Sharing & Federation rounding out the rest. Older materials describe "6 domains," "60 questions," or a fixed "70%" pass mark; those reflect a previous version. The current exam emphasizes Lakeflow Spark Declarative Pipelines, serverless compute, streaming, and deployment with the CLI, REST API, and Asset Bundles. Read the advanced Databricks toolchain guide →

Realistic study path

Most candidates need real production experience plus focused review. A workable path: earn (or be solid on) the Data Engineer Associate first, then take Databricks Academy’s Advanced Data Engineering content and the streaming, performance-optimization, data-privacy, and automated-deployment modules. Build an end-to-end production-style project — streaming ingestion with Auto Loader, CDC and dedup with MERGE and watermarks, Medallion modeling with Type 2 dimensions, governance and masking in Unity Catalog, observability via system tables, and deployment with Asset Bundles — and practice reading the Spark UI to diagnose skew and spill. Read the study plan →

Cost, scheduling, and delivery

The registration fee is $200 USD plus applicable local taxes, and there are no free retakes. The exam is proctored and can be taken online or at a test center, in English, Japanese, Portuguese (Brazil), and Korean. Online delivery requires a quiet private space and a system check through the proctoring provider. Databricks periodically offers discount vouchers through learning events. Verify current fees and scheduling on Databricks’ official page before booking. Databricks’ official certification page →

Recertification

The certification is valid for two years. To stay certified you retake and pass the current version of the exam before it expires — there is no continuing-education-credit alternative. Because Databricks refreshes the exam to track platform changes (the current version reorganized the domains and emphasizes Lakeflow, serverless, streaming, and Asset Bundles), recertifying also keeps your validated skills current. Read the recertification guide →

Career outlook

The Professional credential is a senior-engineer differentiator: it signals you can design secure, cost-effective, scalable pipelines, troubleshoot complex production issues, and apply governance and compliance at scale. It maps to senior/lead data engineer, data architect, and principal engineer roles, which command premium compensation. The credential is most valuable paired with demonstrable production work. For salary ranges and role-specific paths, see the Career Hub. Career Hub — Data Engineer →

Frequently asked questions

Is the Databricks Data Engineer Professional exam hard?

Yes — it is an advanced exam and noticeably harder than the Associate. The 59 questions are scenario-heavy and assume real production experience: you will reason through streaming, CDC, optimization, security, and deployment situations, often with a Python or SQL snippet to interpret. Developing Code for Data Processing alone is 22% of the exam, so deep PySpark and SQL fluency is essential, not optional.

What is the passing score?

Databricks does not publish a fixed numeric passing score on the official exam page; results are reported as pass or fail based on overall performance across all questions. You may see "70%" (or "closer to 80%") quoted on third-party sites, but those figures are not confirmed by Databricks, so treat them as unofficial and aim to be strong across every domain rather than targeting a specific percentage.

How many questions and domains are on the exam? I’ve seen different numbers.

The current exam has 59 scored multiple-choice questions in 120 minutes, organized into ten domains, with Developing Code for Data Processing (22%) and Cost & Performance Optimisation (13%) the heaviest. Older materials cite "6 domains" and "60 questions," and some report "65"; those reflect a previous version. Study against the current ten-domain structure on the official exam page and in the latest exam guide.

Should I take the Associate first?

For most people, yes. The Professional assumes you already know the foundational pipeline, ingestion, and governance concepts covered by the Data Engineer Associate, and it builds well beyond them. If you have substantial production Databricks experience (roughly a year or more) you can attempt the Professional directly, but you should still review Associate-level topics to avoid gaps. If you are newer to the platform, start with the Associate.

What topics carry the most weight?

Prioritize Developing Code for Data Processing (22%) — production PySpark and SQL — and Cost & Performance Optimisation (13%). After those, Transformation & Quality, Monitoring & Alerting, Security & Compliance, and Debugging & Deploying are 10% each. Make sure you can handle streaming with Auto Loader and watermarks, MERGE/CDC, Delta optimization, Unity Catalog governance and masking, observability via system tables, and deployment with Asset Bundles.

How does it differ from the Data Engineer Associate?

The Associate validates foundational pipeline building — ingestion, PySpark/SQL transformations, orchestration, basic CI/CD, and governance — in a 45-question, 90-minute exam. The Professional is a 59-question, 120-minute exam focused on production-grade design: advanced code, streaming, cost and performance optimization, security and compliance, observability, and deployment at scale. Choose the Professional once you operate real production pipelines and want a senior-level credential.

Start your free 24-hour practice trial

Full access to the question bank, both study modes, and domain-level scoring across all ten exam areas. No credit card required.

Start free trial →