Databricks · Practice Exam · Spark 3.5 · Updated for 2026

Databricks Certified Associate Developer for Apache Spark Practice Exam

Practice across all seven exam domains — from Spark architecture and the DataFrame API to Spark SQL, Structured Streaming, Spark Connect, and the Pandas API on Spark. All questions are Python, mirroring the real exam. Get immediate feedback in Learn mode and a full 90-minute simulation in Exam mode. Start with a 24-hour free trial.

Start 24-hour free trial →
500+
Practice questions
7
Exam domains covered
2
Study modes
24h
Free trial access

Exam at a glance

Exam
Databricks Certified Associate Developer for Apache Spark
Format
Multiple choice, proctored (online or test center)
Scored questions
45 (additional unscored items may appear)
Time limit
90 minutes
Registration fee
$200 USD, plus applicable local taxes
Code language
Python only (all code snippets are in Python)
Spark version
Current edition is Apache Spark 3.5
Prerequisites
None; related training highly recommended
Recommended experience
6+ months of hands-on Spark
Passing standard
Databricks does not publish a fixed numeric passing score
Validity
2 years; recertify by taking the current exam
Languages
English

Source: Databricks — Certified Associate Developer for Apache Spark · Exam Guide PDF (Oct 2025)

About this certification

The Associate Developer for Apache Spark validates that you understand Apache Spark’s architecture and can use the Spark DataFrame API to perform core data-manipulation tasks in a Spark session: selecting, renaming, and manipulating columns; filtering, dropping, sorting, and aggregating rows; handling missing data; reading, writing, and partitioning DataFrames with schemas; and working with UDFs and Spark SQL functions. It also covers Spark fundamentals — execution and deployment modes, the execution hierarchy, fault tolerance, lazy evaluation, shuffling, actions, broadcasting — plus Structured Streaming, Spark Connect, and common troubleshooting and tuning.

Unlike the other Databricks certifications, this is a pure Spark exam rather than a platform exam: all code snippets are in Python, and the current edition targets Apache Spark 3.5. It suits developers and engineers who write Spark code directly. For foundational reading on Apache Spark and the DataFrame API, see the Apache Spark Learning Hub guide.

Exam domains and weights

The exam is divided into seven domains. Weights are taken directly from the official Databricks exam page; approximate question counts are derived from the 45 scored questions and rounded.

Apache Spark Architecture and Components

Driver, executors, cluster manager, the execution hierarchy, deployment modes, fault tolerance, lazy evaluation, shuffling, actions vs. transformations, and broadcasting.

20%~9 questions
Using Spark SQL

Querying structured data with Spark SQL — selecting, filtering, joining, and aggregating, and the relationship between SQL and the DataFrame API.

20%~9 questions
Developing Spark DataFrame/DataSet API Applications

The largest domain. Column and row operations, missing-data handling, schemas, reading/writing/partitioning, UDFs, and built-in functions in Python.

30%~14 questions
Troubleshooting and Tuning DataFrame API Applications

Diagnosing skew and shuffle, reading execution plans, caching strategies, partition and shuffle configuration, broadcast joins, and Adaptive Query Execution.

10%~5 questions
Structured Streaming

Streaming fundamentals — output modes, triggers, checkpointing, watermarks and late data, windowing, and fault-tolerance/exactly-once semantics.

10%~5 questions
Using Spark Connect to Deploy Applications

The Spark Connect client-server model — connecting to a remote Spark cluster and the implications for application deployment.

5%~2 questions
Using Pandas API on Apache Spark

Working with large-scale data using familiar pandas syntax on Spark, and how it differs from single-node pandas.

5%~2 questions

Who this exam is for

This credential fits developers, data engineers, and big-data professionals who write Apache Spark code and want to validate core Spark skills. There are no formal prerequisites, so anyone can register; in practice Databricks recommends around six months of hands-on Spark experience. Solid Python and a working understanding of the DataFrame API, Spark SQL, and Spark’s execution model are effectively expected.

Because this is a Spark-focused exam rather than a platform exam, it pairs naturally with the platform-oriented Databricks certifications: if your work centers on building pipelines on the Databricks platform, the Data Engineer Associate is a closer fit; if it centers on SQL analytics, the Data Analyst Associate is. For role-by-role salary ranges and engineering career paths, see the Career Hub — Data Engineer role guide.

What this practice exam delivers

Learn mode

Answer one question at a time with the explanation revealed immediately — ideal for the DataFrame API domain, where reading a Python snippet and picking the correct transformation is the whole point.

Exam mode

45 questions against a 90-minute timer — the real exam format. Build the pacing the code-driven, Python-based questions demand before test day.

Source-linked explanations

Every answer cites the Apache Spark or Databricks documentation it derives from — DataFrame API, Spark SQL, Structured Streaming, Spark Connect — so you can verify the reasoning and dig deeper.

Score by exam domain

Results break down across all seven domains, so practice tells you exactly which area — architecture, DataFrame API, SQL, streaming, tuning — to study next.

Sample practice questions

Ten free questions spanning the seven exam domains, all in Python, each with a full explanation of why the other answers are wrong. The complete bank is available with the 24-hour trial.

Question 1 · Spark Architecture and Components

A developer chains several select and filter calls on a DataFrame, then finally calls count(). When does Spark actually execute the work?

  1. Each select and filter runs immediately as it is called
  2. Nothing runs until count(), because transformations are lazy and the action triggers the job
  3. Work runs only if caching is enabled
  4. Spark executes everything when the SparkSession is created
Show answer & explanation

Correct: B. select and filter are transformations and are lazy — Spark builds a logical plan but does no computation until an action like count() forces execution of the whole chain.

Why not the others: transformations do not run eagerly (A); caching affects reuse, not whether lazy transformations execute (C); creating the SparkSession does not run any job (D). Lazy evaluation plus an action is the core execution model.

Source: Apache Spark — SQL/DataFrame guide →
Question 2 · Spark Architecture and Components

Which of the following is an action (not a transformation) that triggers execution and returns data to the driver?

  1. df.select("col")
  2. df.filter(df.col > 10)
  3. df.collect()
  4. df.withColumn("x", df.col + 1)
Show answer & explanation

Correct: C — collect(). collect() is an action: it materializes the result and returns it to the driver, triggering the job. (Be cautious with it on large data, since it pulls everything to the driver.)

Why not the others: select (A), filter (B), and withColumn (D) are all lazy transformations that only extend the plan. Only collect() forces execution and returns data.

Source: Apache Spark — transformations & actions → Further reading: PowerKram — Spark Execution Model →
Question 3 · Developing DataFrame API Applications

A DataFrame df has a column price. The developer wants a new DataFrame with an added column price_with_tax equal to price * 1.08. Which expression is correct?

  1. df.withColumn("price_with_tax", df.price * 1.08)
  2. df.addColumn("price_with_tax", df.price * 1.08)
  3. df.select(price * 1.08)
  4. df.price_with_tax = df.price * 1.08
Show answer & explanation

Correct: A — withColumn. withColumn(name, expression) returns a new DataFrame with the derived column added (DataFrames are immutable). df.price * 1.08 is a valid column expression.

Why not the others: there is no addColumn method (B); price alone is undefined and select needs a column object (C); DataFrames are immutable, so attribute assignment (D) does not add a column. withColumn is the idiom.

Source: PySpark — DataFrame.withColumn → Further reading: PowerKram — DataFrame API in Python →
Question 4 · Developing DataFrame API Applications

A developer must drop only rows where every column is null, keeping rows that have at least one non-null value. Which call does this?

  1. df.dropna(how="any")
  2. df.dropna(how="all")
  3. df.fillna(0)
  4. df.distinct()
Show answer & explanation

Correct: B — dropna(how="all"). With how="all", a row is dropped only if all of its columns are null, which keeps any row with at least one value — exactly the requirement.

Why not the others: how="any" (A) drops a row if any column is null, which is too aggressive; fillna (C) replaces nulls instead of dropping rows; distinct (D) removes duplicates, unrelated to nulls. how="all" matches.

Source: PySpark — DataFrame.dropna →
Question 5 · Developing DataFrame API Applications

When reading a large set of JSON files repeatedly in production, a developer wants to avoid the cost and unpredictability of schema inference. What is the best practice?

  1. Always rely on inferSchema so Spark figures it out each run
  2. Provide an explicit schema (StructType) when reading the data
  3. Convert everything to strings and parse later
  4. Read one file at a time to keep inference cheap
Show answer & explanation

Correct: B. Supplying an explicit StructType schema avoids an extra inference pass over the data, is faster, and guarantees consistent typing across runs — the production best practice.

Why not the others: relying on inference (A) costs an extra scan and can drift; stringifying everything (C) loses types and pushes work downstream; reading one file at a time (D) cripples parallelism. An explicit schema is correct.

Source: Apache Spark — JSON data source & schemas →
Question 6 · Using Spark SQL

A developer wants to run a SQL query against a DataFrame using spark.sql("SELECT … FROM sales"). What must they do first so the name sales resolves?

  1. Nothing; any DataFrame variable is automatically a SQL table
  2. Register the DataFrame as a temporary view, e.g. df.createOrReplaceTempView("sales")
  3. Export the DataFrame to CSV named sales.csv
  4. Rename the Python variable to sales
Show answer & explanation

Correct: B. createOrReplaceTempView("sales") registers the DataFrame under a name in the session catalog so Spark SQL can reference it as a table in spark.sql(...).

Why not the others: a Python variable is not automatically a SQL-visible table (A); exporting to CSV (C) does not create a queryable view; renaming the Python variable (D) has no effect on SQL name resolution. A temp view is required.

Source: Apache Spark — Spark SQL getting started → Further reading: PowerKram — Spark SQL & Temp Views →
Question 7 · Troubleshooting and Tuning

A join between a very large fact table and a small lookup table is slow due to a shuffle. The lookup table easily fits in memory. What is the most effective optimization?

  1. Increase spark.sql.shuffle.partitions to a very large number
  2. Use a broadcast join so the small table is sent to every executor, avoiding the shuffle
  3. Cache the large fact table before the join
  4. Convert both tables to RDDs first
Show answer & explanation

Correct: B — broadcast join. Broadcasting the small table to all executors lets each one join locally without shuffling the large table, which is the standard fix for a large-table/small-table join. Adaptive Query Execution can also choose this automatically.

Why not the others: bumping shuffle partitions (A) does not remove the shuffle; caching the large table (C) does not address the join-shuffle cost; converting to RDDs (D) discards Catalyst optimizations. Broadcast is the right tool.

Source: Apache Spark — performance tuning (broadcast, AQE) → Further reading: PowerKram — Tuning Spark Joins & Shuffle →
Question 8 · Structured Streaming

A streaming aggregation must handle late-arriving events but cannot keep state forever. Which mechanism bounds how long Spark waits for late data and lets it drop old state?

  1. A watermark on the event-time column
  2. Setting the trigger to once
  3. Increasing the number of shuffle partitions
  4. Disabling checkpointing
Show answer & explanation

Correct: A — a watermark. A watermark on the event-time column tells Spark how late data may arrive; events later than the watermark are dropped and the associated state can be cleaned up, bounding state in stateful streaming aggregations.

Why not the others: a once trigger (B) controls batch cadence, not late-data state; shuffle partitions (C) affect parallelism, not watermarking; disabling checkpointing (D) breaks fault tolerance. The watermark bounds late-data state.

Source: Apache Spark — Structured Streaming (watermarks) → Further reading: PowerKram — Structured Streaming Fundamentals →
Question 9 · Using Spark Connect

A developer wants to run PySpark code from a local IDE or a lightweight client against a remote Spark cluster, using a thin client that sends unresolved logical plans to the server. Which capability enables this?

  1. Spark Connect
  2. Copying the whole cluster runtime to the laptop
  3. Running everything in local mode only
  4. Submitting a JAR by email to the cluster admin
Show answer & explanation

Correct: A — Spark Connect. Spark Connect introduces a client-server architecture where a thin client sends unresolved logical plans to a remote Spark server for execution, enabling remote development from IDEs and lightweight clients.

Why not the others: copying the runtime locally (B) is not how remote execution works; local mode (C) does not use the remote cluster; emailing a JAR (D) is not a deployment mechanism. Spark Connect is the client-server feature being described.

Source: Apache Spark — Spark Connect overview →
Question 10 · Using Pandas API on Apache Spark

A data scientist has large data and wants familiar pandas-style code that scales across a Spark cluster instead of being limited to a single machine’s memory. Which option fits?

  1. Plain pandas with df = pandas.read_csv(...) on the driver
  2. The Pandas API on Spark (e.g. import pyspark.pandas as ps)
  3. Exporting to Excel and using formulas
  4. Looping over rows in pure Python
Show answer & explanation

Correct: B — the Pandas API on Spark. pyspark.pandas offers pandas-compatible syntax that executes on Spark, so familiar pandas code scales across the cluster rather than being bound to one node’s memory.

Why not the others: plain pandas on the driver (A) is single-node and will run out of memory on large data; Excel (C) and row-by-row Python loops (D) do not scale at all. The Pandas API on Spark is purpose-built for this.

Source: Apache Spark — Pandas API on Spark →

Keep going: Learning & Career resources

This certification pays off fastest when it sits on top of real Spark skills and a clear sense of where the role leads. Two PowerKram hubs back this exam up.

Deep dive: exam structure, scoring, study path & recertification

Exam structure and how it’s scored

The exam delivers 45 scored multiple-choice questions in 90 minutes; additional unscored items may appear for calibration, with extra time factored in. Databricks does not publish a fixed numeric passing score on the official exam page, and your result is reported as pass or fail. Questions are code-driven and conceptual, and all code snippets are in Python — you will read DataFrame and Spark SQL code and choose the correct API, output, or fix. Read the exam-format deep dive →

What the seven domains actually test, and what changed

Developing DataFrame/DataSet API Applications (30%) is the largest domain, followed by Spark Architecture (20%) and Spark SQL (20%); Troubleshooting & Tuning and Structured Streaming are 10% each, and Spark Connect and the Pandas API on Spark are 5% each. The current edition targets Apache Spark 3.5 and is Python-only. Older "Spark 3.0" study guides and a circulating "35% DataFrame API" weight are out of date, and Spark Connect and the Pandas API are relatively new dedicated domains. Read the Spark API guide →

Realistic study path

Plan roughly four to eight weeks depending on Spark background. A workable path: a Spark fundamentals course, then consistent hands-on PySpark — write column and row operations, define explicit schemas, build joins and aggregations, register temp views and run Spark SQL, experiment with caching and broadcast joins, build a small Structured Streaming job with a watermark, try Spark Connect from a local client, and run a Pandas-API-on-Spark example. Read execution plans to understand shuffles, since tuning questions are scenario-based. Read the study plan →

Cost, scheduling, and delivery

The registration fee is $200 USD plus applicable local taxes. The exam is proctored and can be taken online or at a test center, in English, and all code is presented in Python. Online delivery requires a quiet private space and a system check through the proctoring provider. Databricks periodically offers discount vouchers through learning events. Verify current fees and scheduling on Databricks’ official page before booking. Databricks’ official certification page →

Recertification

The certification is valid for two years. To stay certified you retake and pass the current version of the exam before it expires — there is no continuing-education-credit alternative. Because Databricks refreshes the exam to track Spark releases (the current edition moved to Spark 3.5 and added Spark Connect and Pandas-API domains), recertifying also keeps your validated skills current. Read the recertification guide →

Career outlook

Apache Spark remains a foundational skill across data engineering and big-data roles, and a vendor-backed Spark credential signals practical command of the DataFrame API, Spark SQL, and the execution model. Because it is framework-focused rather than platform-specific, it complements both the Databricks platform certifications and broader engineering roles. The credential is most valuable paired with real Spark project work. For salary ranges and role-specific paths, see the Career Hub. Career Hub — Data Engineer →

Frequently asked questions

Is the Spark Developer Associate exam in Python or Scala?

The current exam is Python-only — every code snippet you see is in Python (PySpark). Databricks previously offered a Scala variant, but the current edition standardizes on Python, so prepare with PySpark. You should be comfortable reading and reasoning about DataFrame API and Spark SQL code in Python.

What is the passing score?

Databricks does not publish a fixed numeric passing score on the official exam page; results are reported as pass or fail based on overall performance across all questions. You may see "70%" quoted on third-party sites, but that figure is not confirmed by Databricks, so treat it as unofficial and aim to be comfortable across every domain rather than targeting a specific percentage.

Which Spark version does the exam cover?

The current edition targets Apache Spark 3.5, and the exam guide is dated October 2025. Some study materials still reference "Spark 3.0," which is out of date — for example, the current exam includes dedicated domains for Spark Connect and the Pandas API on Spark that older guides do not emphasize. Study against the current Spark 3.5 exam guide.

How many questions and domains are on the exam?

The exam has 45 scored multiple-choice questions in 90 minutes, organized into seven domains: Spark Architecture (20%), Using Spark SQL (20%), Developing DataFrame/DataSet API Applications (30%), Troubleshooting & Tuning (10%), Structured Streaming (10%), Spark Connect (5%), and Pandas API on Spark (5%). Additional unscored items may appear. The DataFrame API domain is the heaviest, so prioritize it.

Do I need experience or prerequisites to take it?

There are no formal prerequisites, so anyone can register. Databricks recommends about six months of hands-on Spark experience. Solid Python plus a working grasp of the DataFrame API, Spark SQL, and Spark’s execution model (lazy evaluation, shuffles, actions vs. transformations) closes most of the gap. If you are new to Spark, plan a few weeks of daily hands-on PySpark practice.

How does it differ from the Databricks platform certifications?

This is a framework exam about Apache Spark itself — the DataFrame API, Spark SQL, the execution model, streaming, and Spark Connect — rather than a Databricks-platform exam. The Data Engineer Associate and Professional focus on building and operating pipelines on the Databricks platform (Delta Lake, Unity Catalog, Lakeflow), and the Data Analyst Associate focuses on Databricks SQL analytics. Choose this one to validate core Spark coding skills.

Start your free 24-hour practice trial

Full access to the question bank, both study modes, and domain-level scoring across all seven exam areas. No credit card required.

Start free trial →