Databricks · Practice Exam · Updated for 2026

Databricks Certified Data Engineer Associate Practice Exam

Practice across all seven exam sections — from data ingestion with Auto Loader and Lakeflow Connect to PySpark transformations, Lakeflow Jobs orchestration, CI/CD with Automation Bundles, and Unity Catalog governance. Get immediate feedback in Learn mode and a full 90-minute simulation in Exam mode. Start with a 24-hour free trial.

Start 24-hour free trial →
500+
Practice questions
7
Exam sections covered
2
Study modes
24h
Free trial access

Exam at a glance

Exam
Databricks Certified Data Engineer Associate
Format
Multiple choice, proctored (online or test center)
Scored questions
45 (additional unscored items may appear)
Time limit
90 minutes
Registration fee
$200 USD, plus applicable local taxes
Prerequisites
None; related training highly recommended
Recommended experience
6+ months of hands-on data engineering on Databricks
Passing standard
Databricks does not publish a fixed numeric passing score
Validity
2 years; recertify by taking the current exam
Languages
English, Japanese, Portuguese (BR), Korean
Blueprint edition
Data Engineer Associate Exam Guide (May 2026 edition)

Source: Databricks — Certified Data Engineer Associate · Exam Guide PDF (May 2026)

About this certification

The Data Engineer Associate is Databricks’ entry-level data-engineering credential. It validates that you can perform foundational engineering tasks on the Databricks Data Intelligence Platform: ingesting data with Auto Loader, COPY INTO, and Lakeflow Connect; transforming and modeling data with PySpark and SQL through the Medallion (bronze/silver/gold) architecture; orchestrating pipelines with Lakeflow Jobs; promoting code across environments with CI/CD and Automation Bundles; and governing data with Unity Catalog. The emphasis is on the Databricks way of doing things — Delta Lake, Unity Catalog, and Lakeflow — rather than generic Spark.

The exam is applied and scenario-based: questions present a realistic situation — a skewed Spark stage, an ingestion choice, a deployment requirement — and ask for the best approach, often with a code snippet to read or debug. Data-manipulation code is shown in SQL where possible and otherwise in Python. Note that the platform has renamed several features (Delta Live Tables to Lakeflow, Asset Bundles to Automation Bundles, Repos to Git Folders), so current terminology matters. For foundational reading on data engineering with Databricks, see the Data Engineering Learning Hub guide.

Exam sections and weights

The exam is divided into seven sections. Weights are taken directly from the official Databricks Exam Guide (May 2026 edition); approximate question counts are derived from the 45 scored questions and rounded.

Databricks Intelligence Platform

Platform architecture, Delta Lake, Unity Catalog fundamentals, and choosing the right compute service and cost model for a workload.

6%~3 questions
Data Ingestion and Loading

Batch, streaming, and incremental ingestion with Auto Loader, COPY INTO, and Lakeflow Connect into Unity Catalog–governed tables; choosing the right method by volume and frequency.

21%~9 questions
Data Transformation and Modeling

The largest section. PySpark/SQL cleaning, joins, aggregations, and deduplication; Medallion layers; and building gold objects like views, materialized views, and streaming tables.

22%~10 questions
Working with Lakeflow Jobs

Orchestrating pipelines with Lakeflow Jobs — control flow, task dependencies on a DAG, schedules, and choosing time-based vs. data-driven triggers.

16%~7 questions
Implementing CI/CD

Databricks Git Folders workflow, environment-specific config, and deploying Automation Bundles across dev/test/prod, including the Databricks CLI.

10%~5 questions
Troubleshooting, Monitoring & Optimization

Reading Lakeflow Jobs run history and the Spark UI to diagnose skew, shuffle, and OOM issues; Liquid Clustering and predictive optimization.

10%~5 questions
Governance and Security

Managed vs. external tables, GRANT/REVOKE/DENY access controls, column masking and row-level security, and Unity Catalog ABAC policies.

15%~7 questions

Who this exam is for

This credential fits data engineers, ETL developers, analytics engineers, and platform-focused professionals who build data pipelines on Databricks. There are no formal prerequisites, so anyone can register; in practice Databricks recommends around six months of hands-on experience with the tasks in the exam guide. Comfort with PySpark DataFrames and Spark SQL, Delta Lake operations, and Unity Catalog basics is effectively expected, along with familiarity with the current Lakeflow tooling.

If your work centers on querying and dashboards rather than building pipelines, the Data Analyst Associate is a closer fit; once you operate production pipelines and want to go deeper into optimization, streaming, and deployment, the Data Engineer Professional is the next step. For role-by-role salary ranges and career paths, see the Career Hub — Data Engineer role guide.

What this practice exam delivers

Learn mode

Answer one question at a time with the explanation revealed immediately — ideal for the transformation and ingestion sections, where reading a code snippet and picking the correct Databricks approach is the whole point.

Exam mode

45 questions against a 90-minute timer — the real exam format. Build the pacing the scenario-based, code-heavy questions demand before test day.

Source-linked explanations

Every answer cites the Databricks documentation it derives from — Auto Loader, Delta Lake, Lakeflow Jobs, Unity Catalog — so you can verify the reasoning and dig deeper.

Score by exam section

Results break down across all seven sections, so practice tells you exactly which area — ingestion, transformation, orchestration, CI/CD, governance — to study next.

Sample practice questions

Ten free questions spanning the seven exam sections, each with a full explanation of why the other answers are wrong. The complete bank is available with the 24-hour trial.

Question 1 · Databricks Intelligence Platform

A team needs rapid pipeline iteration with reliable rollbacks after bad ingests, audit trails for compliance, and a single source of truth for both AI and BI. Which strategy meets these requirements?

  1. DBFS CSV storage with manual file versioning and nightly copies for rollback
  2. Delta Lake ACID transactions and time travel, governed by Unity Catalog for consistent access and lineage
  3. Cloud object storage only, with ad hoc SQL queries for recovery and governance
  4. Ephemeral in-memory DataFrames for audit trails and BI distribution
Show answer & explanation

Correct: B. Delta Lake provides ACID transactions and time travel for safe rollbacks and audit history, and Unity Catalog governs consistent, lineage-tracked access for both AI and BI — the platform combination the requirements describe.

Why not the others: manual CSV versioning (A) is fragile and has no ACID guarantees; raw object storage with ad hoc queries (C) offers no transactional rollback or governance; in-memory DataFrames (D) are ephemeral and cannot serve as an audit trail or shared source of truth.

Source: Databricks — Delta Lake →
Question 2 · Data Ingestion and Loading

New JSON files land continuously in cloud storage and must be ingested incrementally, with automatic schema inference and checkpointing so files are processed exactly once. Which capability fits best?

  1. A one-time manual upload through the UI
  2. Auto Loader, which incrementally processes new files with schema inference and checkpointing
  3. Re-reading the entire directory on every run
  4. Exporting the files to a spreadsheet
Show answer & explanation

Correct: B — Auto Loader. Auto Loader incrementally detects and processes new files as they arrive, with built-in schema inference/evolution and checkpointing so already-ingested files are not reprocessed — the standard pattern for continuous file ingestion.

Why not the others: a manual UI upload (A) does not handle continuously arriving files; re-reading the whole directory each run (C) is inefficient and reprocesses old data; exporting to a spreadsheet (D) is not ingestion. Auto Loader is purpose-built here.

Source: Databricks — Auto Loader → Further reading: PowerKram — Data Ingestion on Databricks →
Question 3 · Data Ingestion and Loading

An engineer wants to idempotently and incrementally load a known set of files from cloud object storage into a Unity Catalog table using a single SQL command. Which command is designed for this?

  1. INSERT INTO with a full table scan each run
  2. COPY INTO
  3. CREATE TABLE AS SELECT, re-run each time
  4. DROP TABLE then reload
Show answer & explanation

Correct: B — COPY INTO. COPY INTO incrementally and idempotently loads files from cloud storage into a Delta/Unity Catalog table, tracking which files it has already ingested so re-running does not duplicate data.

Why not the others: plain INSERT INTO (A) does not track loaded files and can duplicate; re-running CTAS (C) rebuilds the whole table rather than loading incrementally; dropping and reloading (D) is wasteful and loses history. COPY INTO is the SQL incremental-load command.

Source: Databricks — COPY INTO →
Question 4 · Data Transformation and Modeling

In a Medallion architecture, an engineer reads raw bronze data, cleans nulls, standardizes data types, and writes a refined table for downstream use. Which layer does the cleaned table represent?

  1. Bronze
  2. Silver
  3. Gold
  4. Platinum
Show answer & explanation

Correct: B — Silver. The Medallion pattern stages raw data in bronze, cleaned and conformed data in silver, and business-level aggregates in gold. Cleaning nulls and standardizing types on bronze produces a silver table.

Why not the others: bronze (A) is the raw landing layer before cleaning; gold (C) is the curated, aggregated layer for BI/analytics built from silver; "platinum" (D) is not a standard Medallion layer. The described step yields silver.

Source: Databricks — Medallion architecture → Further reading: PowerKram — Medallion & PySpark Transformations →
Question 5 · Data Transformation and Modeling

A gold-layer object must always reflect precomputed, automatically refreshed aggregates for a BI team, without re-running the query on every read. Which object best fits?

  1. A standard view
  2. A materialized view
  3. A temporary view
  4. A CSV extract
Show answer & explanation

Correct: B — a materialized view. A materialized view stores precomputed results and refreshes them, so BI reads hit cached aggregates rather than recomputing — ideal for a gold-layer serving object.

Why not the others: a standard view (A) recomputes its query on every read; a temporary view (C) exists only for the session; a CSV extract (D) is a static snapshot that goes stale. Materialized views provide refreshed, precomputed gold data.

Source: Databricks — materialized views →
Question 6 · Working with Lakeflow Jobs

A pipeline should run only when new files arrive in a monitored location, rather than on a fixed clock schedule. Which trigger type should the engineer choose in Lakeflow Jobs?

  1. A time-based (cron) schedule
  2. A file-arrival (data-driven) trigger
  3. Manual runs only
  4. A continuous always-on cluster with no trigger
Show answer & explanation

Correct: B. A file-arrival trigger is data-driven: the job runs when new files land, which matches "run only when new data arrives" and avoids unnecessary empty runs on a clock.

Why not the others: a cron schedule (A) runs on time regardless of whether data arrived; manual runs (C) are not automated; an always-on cluster with no trigger (D) wastes compute and does not respond to file arrival. Data-driven triggers match data availability.

Source: Databricks — job triggers → Further reading: PowerKram — Orchestrating with Lakeflow Jobs →
Question 7 · Implementing CI/CD

A team wants a modular, versioned way to package, configure, and promote Lakeflow Jobs and pipelines across dev, test, and prod through automated CI/CD. Which feature supports this?

  1. Storing transformation logic in a Volume-mounted notebook and relying on revision history for versioning
  2. Databricks Asset Bundles / Automation Bundles, defined in code, versioned in Git, and promoted via CI/CD
  3. Registering each ETL job as a model in Unity Catalog with aliases
  4. Manually copying notebooks between workspaces
Show answer & explanation

Correct: B. Automation Bundles (formerly Databricks Asset Bundles) define jobs, pipelines, and config as code, version them in Git, and promote consistent deployments across environments through CI/CD — exactly the modular, repeatable mechanism described.

Why not the others: notebook revision history (A) is not real version control or packaging; model aliases (C) govern ML model versions, not ETL job deployment; manual copying (D) is error-prone and unversioned. Bundles are the supported CI/CD packaging unit.

Source: Databricks — Asset Bundles → Further reading: PowerKram — CI/CD with Automation Bundles →
Question 8 · Troubleshooting, Monitoring & Optimization

A batch job slowed sharply. In the Spark UI, most tasks in the longest stage finish in seconds, but one task runs far longer, with max shuffle read many times the median. What is the most likely cause and fix?

  1. Healthy parallelism; simply add executors to finish the slow task faster
  2. Data skew on the join key; enable adaptive query execution with skew-join handling (or salt the key) to split the oversized partition
  3. Too few partitions; always reduce shuffle partitions to one
  4. A network outage; restart the cluster
Show answer & explanation

Correct: B. One task far slower than the rest with a much larger shuffle read is the classic signature of data skew. Adaptive query execution with skew-join handling splits the oversized partition at runtime; salting the key is the manual alternative.

Why not the others: adding executors (A) does not help a single skewed task that one machine must process; collapsing to one shuffle partition (C) worsens parallelism; a network outage (D) would not produce this specific per-task skew signature. Diagnose skew from the stage metrics.

Source: Databricks — adaptive query execution →
Question 9 · Governance and Security

A data engineer drops a Unity Catalog table and expects the underlying data files to be deleted automatically. For which table type is this true?

  1. An external table
  2. A managed table
  3. A temporary view
  4. Any table, regardless of type
Show answer & explanation

Correct: B — a managed table. Unity Catalog manages both the metadata and the data files for a managed table, so dropping it removes the underlying files. Knowing managed vs. external behavior is a core governance objective.

Why not the others: dropping an external table (A) removes only the metadata and leaves the files in their external location; a temporary view (C) has no underlying managed data files; the behavior is not the same for all types (D), which is the whole point of the distinction.

Source: Databricks — managed vs. external tables → Further reading: PowerKram — Unity Catalog Governance →
Question 10 · Governance and Security

An engineer must give an analytics group read access to a table while hiding a sensitive salary column from them. Which Unity Catalog features achieve this?

  1. Email the group a filtered CSV export
  2. GRANT SELECT to the group, plus a column mask (or column-level security) on the salary column
  3. Make the table public and ask the group not to look at salary
  4. Give the group the metastore admin role
Show answer & explanation

Correct: B. Granting SELECT to the group provides read access, and a column mask or column-level security on the salary column hides that field from them — the governed, least-privilege way to meet the requirement in Unity Catalog.

Why not the others: a CSV export (A) bypasses governance and creates an uncontrolled copy; relying on people not to look (C) is not a control; granting metastore admin (D) massively over-privileges the group. Scoped GRANT plus column masking is correct.

Source: Databricks — column masks →

Keep going: Learning & Career resources

This certification pays off fastest when it sits on top of real platform skills and a clear sense of where the role leads. Two PowerKram hubs back this exam up.

Deep dive: exam structure, scoring, study path & recertification

Exam structure and how it’s scored

The exam delivers 45 scored multiple-choice questions in 90 minutes; additional unscored items may appear for calibration, with extra time factored in. Databricks does not publish a fixed numeric passing score on the official exam page, and your result is reported as pass or fail. Questions are applied and scenario-based, frequently presenting a code snippet or a system situation and asking for the best Databricks approach; data-manipulation code is shown in SQL where possible and otherwise in Python. Read the exam-format deep dive →

What the seven sections actually test, and what changed

The current guide (May 2026) has seven sections: Data Transformation and Modeling (22%) and Data Ingestion and Loading (21%) are the two heaviest and together are over 40% of the exam, followed by Working with Lakeflow Jobs (16%), Governance and Security (15%), and three smaller sections — Implementing CI/CD (10%), Troubleshooting/Monitoring/Optimization (10%), and the Databricks Intelligence Platform (6%). Older materials describe a five-domain split and "52 questions"; that is stale. Note the renames: Delta Live Tables is now Lakeflow, Asset Bundles are Automation Bundles, and Repos are Git Folders. Read the Databricks data-engineering toolchain guide →

Realistic study path

Plan roughly four to eight weeks depending on Spark background. A workable path: the Databricks Academy data-engineering learning path (Lakeflow Connect ingestion, Lakeflow Jobs, DevOps essentials, Unity Catalog governance, Lakeflow Spark Declarative Pipelines), then build an end-to-end project — ingest with Auto Loader and COPY INTO, transform through bronze/silver/gold with PySpark, orchestrate with Lakeflow Jobs, deploy with Automation Bundles, and govern with Unity Catalog. Practice reading the Spark UI for skew and shuffle, since troubleshooting questions are scenario-based. Read the study plan →

Cost, scheduling, and delivery

The registration fee is $200 USD plus applicable local taxes. The exam is proctored and can be taken online or at a test center, and is offered in English, Japanese, Portuguese (Brazil), and Korean. Online delivery requires a quiet private space and a system check through the proctoring provider. Databricks periodically offers discount vouchers through learning events. Verify current fees and scheduling on Databricks’ official page before booking. Databricks’ official certification page →

Recertification

The certification is valid for two years. To stay certified you retake and pass the current version of the exam before it expires — there is no continuing-education-credit alternative. Because Databricks refreshes the exam to track platform changes (the recent update emphasizes Lakeflow, Automation Bundles, and Unity Catalog and reorganized the sections), recertifying also keeps your validated skills current. Read the recertification guide →

Career outlook

Data engineers with lakehouse-platform expertise are in strong demand as organizations modernize their data infrastructure, and a platform-specific associate credential signals practical competence with Delta Lake, Lakeflow, and Unity Catalog rather than generic Spark. The credential is most valuable paired with demonstrable pipeline work and, over time, with the Data Engineer Professional for production-scale roles. For salary ranges and role-specific paths, see the Career Hub. Career Hub — Data Engineer →

Frequently asked questions

Is the Databricks Data Engineer Associate exam hard?

It is an associate-level exam and most candidates with hands-on Databricks experience find it fair, but it is applied rather than theoretical. The 45 questions are scenario-based and often include a code snippet to read or debug, and they test the Databricks way — Delta Lake, Auto Loader, Lakeflow, Unity Catalog — not generic Spark. Ingestion and transformation together are over 40% of the exam, so weak PySpark/SQL or Auto Loader knowledge shows quickly.

What is the passing score?

Databricks does not publish a fixed numeric passing score on the official exam page; results are reported as pass or fail based on overall performance. You may see "70%" quoted on third-party sites, but that figure is not confirmed by Databricks, so treat it as unofficial and aim to be comfortable across every section rather than targeting a specific percentage.

How many questions and sections are on the exam? I’ve seen different numbers.

The current exam (May 2026 guide) has 45 scored multiple-choice questions in 90 minutes, organized into seven sections: Databricks Intelligence Platform (6%), Data Ingestion and Loading (21%), Data Transformation and Modeling (22%), Working with Lakeflow Jobs (16%), Implementing CI/CD (10%), Troubleshooting/Monitoring/Optimization (10%), and Governance and Security (15%). Older materials cite a five-domain split and "52 questions"; that reflects a previous version. Study against the current seven-section structure in the official exam guide.

Do I need experience or prerequisites to take it?

There are no formal prerequisites, so anyone can register. Databricks recommends about six months of hands-on experience with the data-engineering tasks in the exam guide. Comfort with PySpark DataFrames and Spark SQL, Delta Lake operations, and Unity Catalog basics closes most of the gap. If you are new to Spark, plan more study time and daily hands-on practice writing PySpark.

What changed in the recent exam update?

The exam was revised (around mid-2025 and again in May 2026) to emphasize Lakeflow (the rename of Delta Live Tables), Automation Bundles (formerly Databricks Asset Bundles), Git Folders (formerly Repos), and scenario-based reasoning, and it reorganized into seven sections. If your study materials predate these changes, they will use old names and a different domain split, so confirm against the current official exam guide.

How does it differ from the Data Analyst Associate and Data Engineer Professional exams?

The Data Engineer Associate is about building foundational pipelines — ingestion, PySpark/SQL transformations, orchestration, CI/CD, and governance. The Data Analyst Associate focuses on querying, dashboards, and BI with Databricks SQL rather than pipelines. The Data Engineer Professional goes deeper into production-scale design, streaming, optimization, and advanced deployment. Choose the engineer associate if you build pipelines and want a foundational credential.

Start your free 24-hour practice trial

Full access to the question bank, both study modes, and section-level scoring across all seven exam areas. No credit card required.

Start free trial →