Databricks Certified Associate Developer for Apache Spark Practice Exam
Practice across all seven exam domains — from Spark architecture and the DataFrame API to Spark SQL, Structured Streaming, Spark Connect, and the Pandas API on Spark. All questions are Python, mirroring the real exam. Get immediate feedback in Learn mode and a full 90-minute simulation in Exam mode. Start with a 24-hour free trial.
Start 24-hour free trial →Exam at a glance
- Exam
- Databricks Certified Associate Developer for Apache Spark
- Format
- Multiple choice, proctored (online or test center)
- Scored questions
- 45 (additional unscored items may appear)
- Time limit
- 90 minutes
- Registration fee
- $200 USD, plus applicable local taxes
- Code language
- Python only (all code snippets are in Python)
- Spark version
- Current edition is Apache Spark 3.5
- Prerequisites
- None; related training highly recommended
- Recommended experience
- 6+ months of hands-on Spark
- Passing standard
- Databricks does not publish a fixed numeric passing score
- Validity
- 2 years; recertify by taking the current exam
- Languages
- English
Source: Databricks — Certified Associate Developer for Apache Spark · Exam Guide PDF (Oct 2025)
About this certification
The Associate Developer for Apache Spark validates that you understand Apache Spark’s architecture and can use the Spark DataFrame API to perform core data-manipulation tasks in a Spark session: selecting, renaming, and manipulating columns; filtering, dropping, sorting, and aggregating rows; handling missing data; reading, writing, and partitioning DataFrames with schemas; and working with UDFs and Spark SQL functions. It also covers Spark fundamentals — execution and deployment modes, the execution hierarchy, fault tolerance, lazy evaluation, shuffling, actions, broadcasting — plus Structured Streaming, Spark Connect, and common troubleshooting and tuning.
Unlike the other Databricks certifications, this is a pure Spark exam rather than a platform exam: all code snippets are in Python, and the current edition targets Apache Spark 3.5. It suits developers and engineers who write Spark code directly. For foundational reading on Apache Spark and the DataFrame API, see the Apache Spark Learning Hub guide.
Exam domains and weights
The exam is divided into seven domains. Weights are taken directly from the official Databricks exam page; approximate question counts are derived from the 45 scored questions and rounded.
Driver, executors, cluster manager, the execution hierarchy, deployment modes, fault tolerance, lazy evaluation, shuffling, actions vs. transformations, and broadcasting.
Querying structured data with Spark SQL — selecting, filtering, joining, and aggregating, and the relationship between SQL and the DataFrame API.
The largest domain. Column and row operations, missing-data handling, schemas, reading/writing/partitioning, UDFs, and built-in functions in Python.
Diagnosing skew and shuffle, reading execution plans, caching strategies, partition and shuffle configuration, broadcast joins, and Adaptive Query Execution.
Streaming fundamentals — output modes, triggers, checkpointing, watermarks and late data, windowing, and fault-tolerance/exactly-once semantics.
The Spark Connect client-server model — connecting to a remote Spark cluster and the implications for application deployment.
Working with large-scale data using familiar pandas syntax on Spark, and how it differs from single-node pandas.
Who this exam is for
This credential fits developers, data engineers, and big-data professionals who write Apache Spark code and want to validate core Spark skills. There are no formal prerequisites, so anyone can register; in practice Databricks recommends around six months of hands-on Spark experience. Solid Python and a working understanding of the DataFrame API, Spark SQL, and Spark’s execution model are effectively expected.
Because this is a Spark-focused exam rather than a platform exam, it pairs naturally with the platform-oriented Databricks certifications: if your work centers on building pipelines on the Databricks platform, the Data Engineer Associate is a closer fit; if it centers on SQL analytics, the Data Analyst Associate is. For role-by-role salary ranges and engineering career paths, see the Career Hub — Data Engineer role guide.
What this practice exam delivers
Learn mode
Answer one question at a time with the explanation revealed immediately — ideal for the DataFrame API domain, where reading a Python snippet and picking the correct transformation is the whole point.
Exam mode
45 questions against a 90-minute timer — the real exam format. Build the pacing the code-driven, Python-based questions demand before test day.
Source-linked explanations
Every answer cites the Apache Spark or Databricks documentation it derives from — DataFrame API, Spark SQL, Structured Streaming, Spark Connect — so you can verify the reasoning and dig deeper.
Score by exam domain
Results break down across all seven domains, so practice tells you exactly which area — architecture, DataFrame API, SQL, streaming, tuning — to study next.
Sample practice questions
Ten free questions spanning the seven exam domains, all in Python, each with a full explanation of why the other answers are wrong. The complete bank is available with the 24-hour trial.
A developer chains several select and filter calls on a DataFrame, then finally calls count(). When does Spark actually execute the work?
- Each select and filter runs immediately as it is called
- Nothing runs until count(), because transformations are lazy and the action triggers the job
- Work runs only if caching is enabled
- Spark executes everything when the SparkSession is created
Show answer & explanation
Correct: B. select and filter are transformations and are lazy — Spark builds a logical plan but does no computation until an action like count() forces execution of the whole chain.
Why not the others: transformations do not run eagerly (A); caching affects reuse, not whether lazy transformations execute (C); creating the SparkSession does not run any job (D). Lazy evaluation plus an action is the core execution model.
Source: Apache Spark — SQL/DataFrame guide →Which of the following is an action (not a transformation) that triggers execution and returns data to the driver?
- df.select("col")
- df.filter(df.col > 10)
- df.collect()
- df.withColumn("x", df.col + 1)
Show answer & explanation
Correct: C — collect(). collect() is an action: it materializes the result and returns it to the driver, triggering the job. (Be cautious with it on large data, since it pulls everything to the driver.)
Why not the others: select (A), filter (B), and withColumn (D) are all lazy transformations that only extend the plan. Only collect() forces execution and returns data.
Source: Apache Spark — transformations & actions → Further reading: PowerKram — Spark Execution Model →A DataFrame df has a column price. The developer wants a new DataFrame with an added column price_with_tax equal to price * 1.08. Which expression is correct?
- df.withColumn("price_with_tax", df.price * 1.08)
- df.addColumn("price_with_tax", df.price * 1.08)
- df.select(price * 1.08)
- df.price_with_tax = df.price * 1.08
Show answer & explanation
Correct: A — withColumn. withColumn(name, expression) returns a new DataFrame with the derived column added (DataFrames are immutable). df.price * 1.08 is a valid column expression.
Why not the others: there is no addColumn method (B); price alone is undefined and select needs a column object (C); DataFrames are immutable, so attribute assignment (D) does not add a column. withColumn is the idiom.
A developer must drop only rows where every column is null, keeping rows that have at least one non-null value. Which call does this?
- df.dropna(how="any")
- df.dropna(how="all")
- df.fillna(0)
- df.distinct()
Show answer & explanation
Correct: B — dropna(how="all"). With how="all", a row is dropped only if all of its columns are null, which keeps any row with at least one value — exactly the requirement.
Why not the others: how="any" (A) drops a row if any column is null, which is too aggressive; fillna (C) replaces nulls instead of dropping rows; distinct (D) removes duplicates, unrelated to nulls. how="all" matches.
When reading a large set of JSON files repeatedly in production, a developer wants to avoid the cost and unpredictability of schema inference. What is the best practice?
- Always rely on inferSchema so Spark figures it out each run
- Provide an explicit schema (StructType) when reading the data
- Convert everything to strings and parse later
- Read one file at a time to keep inference cheap
Show answer & explanation
Correct: B. Supplying an explicit StructType schema avoids an extra inference pass over the data, is faster, and guarantees consistent typing across runs — the production best practice.
Why not the others: relying on inference (A) costs an extra scan and can drift; stringifying everything (C) loses types and pushes work downstream; reading one file at a time (D) cripples parallelism. An explicit schema is correct.
Source: Apache Spark — JSON data source & schemas →A developer wants to run a SQL query against a DataFrame using spark.sql("SELECT … FROM sales"). What must they do first so the name sales resolves?
- Nothing; any DataFrame variable is automatically a SQL table
- Register the DataFrame as a temporary view, e.g. df.createOrReplaceTempView("sales")
- Export the DataFrame to CSV named sales.csv
- Rename the Python variable to sales
Show answer & explanation
Correct: B. createOrReplaceTempView("sales") registers the DataFrame under a name in the session catalog so Spark SQL can reference it as a table in spark.sql(...).
Why not the others: a Python variable is not automatically a SQL-visible table (A); exporting to CSV (C) does not create a queryable view; renaming the Python variable (D) has no effect on SQL name resolution. A temp view is required.
Source: Apache Spark — Spark SQL getting started → Further reading: PowerKram — Spark SQL & Temp Views →A join between a very large fact table and a small lookup table is slow due to a shuffle. The lookup table easily fits in memory. What is the most effective optimization?
- Increase spark.sql.shuffle.partitions to a very large number
- Use a broadcast join so the small table is sent to every executor, avoiding the shuffle
- Cache the large fact table before the join
- Convert both tables to RDDs first
Show answer & explanation
Correct: B — broadcast join. Broadcasting the small table to all executors lets each one join locally without shuffling the large table, which is the standard fix for a large-table/small-table join. Adaptive Query Execution can also choose this automatically.
Why not the others: bumping shuffle partitions (A) does not remove the shuffle; caching the large table (C) does not address the join-shuffle cost; converting to RDDs (D) discards Catalyst optimizations. Broadcast is the right tool.
Source: Apache Spark — performance tuning (broadcast, AQE) → Further reading: PowerKram — Tuning Spark Joins & Shuffle →A streaming aggregation must handle late-arriving events but cannot keep state forever. Which mechanism bounds how long Spark waits for late data and lets it drop old state?
- A watermark on the event-time column
- Setting the trigger to once
- Increasing the number of shuffle partitions
- Disabling checkpointing
Show answer & explanation
Correct: A — a watermark. A watermark on the event-time column tells Spark how late data may arrive; events later than the watermark are dropped and the associated state can be cleaned up, bounding state in stateful streaming aggregations.
Why not the others: a once trigger (B) controls batch cadence, not late-data state; shuffle partitions (C) affect parallelism, not watermarking; disabling checkpointing (D) breaks fault tolerance. The watermark bounds late-data state.
Source: Apache Spark — Structured Streaming (watermarks) → Further reading: PowerKram — Structured Streaming Fundamentals →A developer wants to run PySpark code from a local IDE or a lightweight client against a remote Spark cluster, using a thin client that sends unresolved logical plans to the server. Which capability enables this?
- Spark Connect
- Copying the whole cluster runtime to the laptop
- Running everything in local mode only
- Submitting a JAR by email to the cluster admin
Show answer & explanation
Correct: A — Spark Connect. Spark Connect introduces a client-server architecture where a thin client sends unresolved logical plans to a remote Spark server for execution, enabling remote development from IDEs and lightweight clients.
Why not the others: copying the runtime locally (B) is not how remote execution works; local mode (C) does not use the remote cluster; emailing a JAR (D) is not a deployment mechanism. Spark Connect is the client-server feature being described.
Source: Apache Spark — Spark Connect overview →A data scientist has large data and wants familiar pandas-style code that scales across a Spark cluster instead of being limited to a single machine’s memory. Which option fits?
- Plain pandas with df = pandas.read_csv(...) on the driver
- The Pandas API on Spark (e.g. import pyspark.pandas as ps)
- Exporting to Excel and using formulas
- Looping over rows in pure Python
Show answer & explanation
Correct: B — the Pandas API on Spark. pyspark.pandas offers pandas-compatible syntax that executes on Spark, so familiar pandas code scales across the cluster rather than being bound to one node’s memory.
Why not the others: plain pandas on the driver (A) is single-node and will run out of memory on large data; Excel (C) and row-by-row Python loops (D) do not scale at all. The Pandas API on Spark is purpose-built for this.
Source: Apache Spark — Pandas API on Spark →Keep going: Learning & Career resources
This certification pays off fastest when it sits on top of real Spark skills and a clear sense of where the role leads. Two PowerKram hubs back this exam up.
Deep dive: exam structure, scoring, study path & recertification
Exam structure and how it’s scored
The exam delivers 45 scored multiple-choice questions in 90 minutes; additional unscored items may appear for calibration, with extra time factored in. Databricks does not publish a fixed numeric passing score on the official exam page, and your result is reported as pass or fail. Questions are code-driven and conceptual, and all code snippets are in Python — you will read DataFrame and Spark SQL code and choose the correct API, output, or fix. Read the exam-format deep dive →
What the seven domains actually test, and what changed
Developing DataFrame/DataSet API Applications (30%) is the largest domain, followed by Spark Architecture (20%) and Spark SQL (20%); Troubleshooting & Tuning and Structured Streaming are 10% each, and Spark Connect and the Pandas API on Spark are 5% each. The current edition targets Apache Spark 3.5 and is Python-only. Older "Spark 3.0" study guides and a circulating "35% DataFrame API" weight are out of date, and Spark Connect and the Pandas API are relatively new dedicated domains. Read the Spark API guide →
Realistic study path
Plan roughly four to eight weeks depending on Spark background. A workable path: a Spark fundamentals course, then consistent hands-on PySpark — write column and row operations, define explicit schemas, build joins and aggregations, register temp views and run Spark SQL, experiment with caching and broadcast joins, build a small Structured Streaming job with a watermark, try Spark Connect from a local client, and run a Pandas-API-on-Spark example. Read execution plans to understand shuffles, since tuning questions are scenario-based. Read the study plan →
Cost, scheduling, and delivery
The registration fee is $200 USD plus applicable local taxes. The exam is proctored and can be taken online or at a test center, in English, and all code is presented in Python. Online delivery requires a quiet private space and a system check through the proctoring provider. Databricks periodically offers discount vouchers through learning events. Verify current fees and scheduling on Databricks’ official page before booking. Databricks’ official certification page →
Recertification
The certification is valid for two years. To stay certified you retake and pass the current version of the exam before it expires — there is no continuing-education-credit alternative. Because Databricks refreshes the exam to track Spark releases (the current edition moved to Spark 3.5 and added Spark Connect and Pandas-API domains), recertifying also keeps your validated skills current. Read the recertification guide →
Career outlook
Apache Spark remains a foundational skill across data engineering and big-data roles, and a vendor-backed Spark credential signals practical command of the DataFrame API, Spark SQL, and the execution model. Because it is framework-focused rather than platform-specific, it complements both the Databricks platform certifications and broader engineering roles. The credential is most valuable paired with real Spark project work. For salary ranges and role-specific paths, see the Career Hub. Career Hub — Data Engineer →
Frequently asked questions
Is the Spark Developer Associate exam in Python or Scala?
The current exam is Python-only — every code snippet you see is in Python (PySpark). Databricks previously offered a Scala variant, but the current edition standardizes on Python, so prepare with PySpark. You should be comfortable reading and reasoning about DataFrame API and Spark SQL code in Python.
What is the passing score?
Databricks does not publish a fixed numeric passing score on the official exam page; results are reported as pass or fail based on overall performance across all questions. You may see "70%" quoted on third-party sites, but that figure is not confirmed by Databricks, so treat it as unofficial and aim to be comfortable across every domain rather than targeting a specific percentage.
Which Spark version does the exam cover?
The current edition targets Apache Spark 3.5, and the exam guide is dated October 2025. Some study materials still reference "Spark 3.0," which is out of date — for example, the current exam includes dedicated domains for Spark Connect and the Pandas API on Spark that older guides do not emphasize. Study against the current Spark 3.5 exam guide.
How many questions and domains are on the exam?
The exam has 45 scored multiple-choice questions in 90 minutes, organized into seven domains: Spark Architecture (20%), Using Spark SQL (20%), Developing DataFrame/DataSet API Applications (30%), Troubleshooting & Tuning (10%), Structured Streaming (10%), Spark Connect (5%), and Pandas API on Spark (5%). Additional unscored items may appear. The DataFrame API domain is the heaviest, so prioritize it.
Do I need experience or prerequisites to take it?
There are no formal prerequisites, so anyone can register. Databricks recommends about six months of hands-on Spark experience. Solid Python plus a working grasp of the DataFrame API, Spark SQL, and Spark’s execution model (lazy evaluation, shuffles, actions vs. transformations) closes most of the gap. If you are new to Spark, plan a few weeks of daily hands-on PySpark practice.
How does it differ from the Databricks platform certifications?
This is a framework exam about Apache Spark itself — the DataFrame API, Spark SQL, the execution model, streaming, and Spark Connect — rather than a Databricks-platform exam. The Data Engineer Associate and Professional focus on building and operating pipelines on the Databricks platform (Delta Lake, Unity Catalog, Lakeflow), and the Data Analyst Associate focuses on Databricks SQL analytics. Choose this one to validate core Spark coding skills.
Start your free 24-hour practice trial
Full access to the question bank, both study modes, and domain-level scoring across all seven exam areas. No credit card required.
Start free trial →