Databricks · Practice Exam · Updated for 2026

Databricks Certified Machine Learning Associate Practice Exam

Practice across all four exam domains — the Databricks ML platform, ML workflows, model development, and model deployment — with AutoML, MLflow, Feature Store, and Spark ML scenarios. Get immediate feedback in Learn mode and a full 90-minute simulation in Exam mode. Start with a 24-hour free trial.

Start 24-hour free trial →
500+
Practice questions
4
Exam domains covered
2
Study modes
24h
Free trial access

Exam at a glance

Exam
Databricks Certified Machine Learning Associate
Format
Multiple choice, proctored (online or test center)
Scored questions
48 (additional unscored items may appear)
Time limit
90 minutes
Registration fee
$200 USD, plus applicable local taxes
Prerequisites
None; related training highly recommended
Recommended experience
6+ months of hands-on ML work on Databricks
Passing standard
Databricks does not publish a fixed numeric passing score
Validity
2 years; recertify by taking the current exam
Languages
English, Japanese, Portuguese (BR), Korean
Blueprint edition
ML Associate Exam Guide (March 2025 edition)

Source: Databricks — Certified Machine Learning Associate · Exam Guide PDF

About this certification

The Machine Learning Associate is Databricks’ entry-level ML credential. It validates that you can carry out the everyday tasks of a machine learning practitioner on the Databricks platform: navigating Databricks Machine Learning and the ML Runtime, running AutoML, exploring data and engineering features with the Feature Store, training and tuning models with Spark ML, and tracking and deploying them with MLflow and Unity Catalog. It is aimed at people who already do basic ML work and want a platform-specific credential, rather than at researchers designing novel models.

The exam is practical and scenario-based rather than theory-heavy — questions describe a realistic workflow situation (a cluster choice, an AutoML result, an MLflow registry step) and ask for the correct Databricks action or function. All machine learning code in the exam is Python, with some non-ML data manipulation possibly shown in SQL. For foundational reading on the ML lifecycle on Databricks, see the Machine Learning Learning Hub guide.

Exam domains and weights

The exam is divided into four domains. Weights are taken directly from the official Databricks exam page; approximate question counts are derived from the 48 scored questions and rounded.

Databricks Machine Learning

The largest domain. The Databricks ML environment — clusters and Repos, Databricks Runtime for ML, AutoML, Feature Store basics, and MLflow tracking, models, and the Model Registry.

38%~18 questions
ML Workflows

Exploratory data analysis, feature engineering, training/validation/test splitting, and the decisions that shape a sound ML workflow.

19%~9 questions
Model Development

Building models with Spark ML and scikit-learn, hyperparameter tuning (including Hyperopt), and model evaluation and selection.

31%~15 questions
Model Deployment

Batch, streaming, and real-time deployment patterns, plus moving registered models through stages for production use.

12%~6 questions

Who this exam is for

This credential fits data scientists, ML practitioners, analysts moving into ML, and data engineers who run machine learning workloads on Databricks. There are no formal prerequisites, so anyone can register; in practice Databricks recommends around six months of hands-on experience with the ML tasks in the exam guide, basic Python and data-science fundamentals, and comfort with Delta Lake and SQL for data manipulation.

If your work is more focused on building data pipelines than on modeling, the Data Engineer Associate is a closer fit; if you already operate production ML and want to go deeper into MLOps and monitoring, the ML Professional is the next step up. For role-by-role salary ranges and career paths, see the Career Hub — Machine Learning Engineer role guide.

What this practice exam delivers

Learn mode

Answer one question at a time with the explanation revealed immediately — ideal for the Databricks Machine Learning domain, where knowing the exact tool, page, or function name is the whole point.

Exam mode

48 questions against a 90-minute timer — the real exam format. Build the pacing the scenario-based questions demand before test day.

Source-linked explanations

Every answer cites the Databricks documentation it derives from — AutoML, MLflow, Feature Store, Spark ML — so you can verify the reasoning and dig deeper.

Score by exam domain

Results break down across all four domains — platform, workflows, development, and deployment — so practice tells you exactly which area to study next.

Sample practice questions

Ten free questions spanning the four exam domains, each with a full explanation of why the other answers are wrong. The complete bank is available with the 24-hour trial.

Question 1 · Databricks Machine Learning

An engineer wants their notebooks to load the MLflow library without installing it on every cluster. What is the recommended approach?

  1. Add a line enabling Databricks Runtime ML to a cluster init script
  2. Select a Databricks Runtime ML version from the runtime dropdown when creating the cluster
  3. Set a runtime-version variable to "ml" in the Spark session
  4. Install MLflow into the workspace once and share it globally
Show answer & explanation

Correct: B. Choosing a Databricks Runtime for Machine Learning version when creating the cluster pre-installs MLflow and the common ML libraries, so no per-cluster install is needed.

Why not the others: there is no init-script flag that "enables" Runtime ML (A); there is no runtime-version Spark variable that switches to ML (C); workspace-wide library sharing (D) is not how the ML Runtime is provided. The runtime is selected at cluster creation.

Source: Databricks — Databricks Runtime for ML →
Question 2 · Databricks Machine Learning

A data scientist uses AutoML for a regression problem. Which step must they perform outside the AutoML experiment itself?

  1. Model tuning
  2. Model evaluation
  3. Model deployment
  4. Exploratory data analysis
Show answer & explanation

Correct: C — Model deployment. AutoML automates data exploration, feature processing, model training, tuning, and evaluation, and surfaces the best model with a notebook — but deploying that model is a separate step the practitioner performs afterward.

Why not the others: tuning (A), evaluation (B), and a data-exploration notebook (D) are all generated inside the AutoML experiment. Deployment is the one task left to the user.

Source: Databricks — AutoML → Further reading: PowerKram — AutoML on Databricks →
Question 3 · Databricks Machine Learning

After many runs in an MLflow experiment, an engineer wants to programmatically find the run_id with the lowest RMSE. Which approach works?

  1. Open each run in the UI and compare manually
  2. Use mlflow.search_runs() with an order_by on the RMSE metric and take the top row
  3. Delete all but the most recent run
  4. Re-run the experiment until RMSE is lowest
Show answer & explanation

Correct: B. mlflow.search_runs() returns runs as a dataframe and accepts an order_by on a metric such as RMSE, so the best run_id is the first row of the sorted result.

Why not the others: manual UI comparison (A) is not programmatic; deleting runs (C) destroys history and does not identify the best one; re-running (D) does not query existing results. Searching runs by metric is the intended API.

Source: Databricks — MLflow tracking → Further reading: PowerKram — MLflow Tracking & Experiments →
Question 4 · ML Workflows

During exploratory data analysis, a numeric feature has a few extreme values far outside the typical range. Which action is the most appropriate first step?

  1. Immediately delete every row containing any high value
  2. Investigate the outliers to decide whether they are errors or genuine signal before deciding how to handle them
  3. Replace all values with the column mean
  4. Ignore them; outliers never affect models
Show answer & explanation

Correct: B. Sound EDA investigates outliers first — they may be data-entry errors to remove, or legitimate rare events that carry signal — and only then chooses removal, capping, or transformation.

Why not the others: blanket deletion (A) can discard valid data and bias the model; mean-replacing every value (C) destroys real variation; ignoring outliers (D) is false, since many models are sensitive to them. Understanding before acting is the workflow principle.

Source: Databricks — data exploration in ML →
Question 5 · ML Workflows

A team wants reusable, governed features shared across multiple models and available for both training and batch scoring. Which Databricks capability fits best?

  1. A one-off pandas dataframe saved in the notebook
  2. Databricks Feature Store / Feature Engineering in Unity Catalog
  3. A CSV file in cloud storage
  4. A temporary view recreated each run
Show answer & explanation

Correct: B. The Feature Store (Feature Engineering in Unity Catalog) is built to register, version, and govern features for reuse across models, supplying the same definitions to training and to batch or online scoring.

Why not the others: a notebook dataframe (A), a CSV (C), and a temporary view (D) are all ad hoc, ungoverned, and not shareable or versioned across models and serving — the exact gap the Feature Store fills.

Source: Databricks — Feature Store → Further reading: PowerKram — Feature Engineering on Databricks →
Question 6 · Model Development

An engineer wants efficient, parallelized hyperparameter search over a large space rather than exhaustively trying every combination. Which tool is designed for this on Databricks?

  1. A manual for-loop over hand-picked values
  2. Hyperopt with its search algorithms (e.g., Tree of Parzen Estimators)
  3. Increasing the cluster size only
  4. Training once with default parameters
Show answer & explanation

Correct: B. Hyperopt performs informed hyperparameter search (such as TPE) and parallelizes trials on Databricks, making it the intended tool for tuning over a large space efficiently.

Why not the others: a manual loop (A) is grid/manual search that scales poorly; a bigger cluster (C) adds compute but no search strategy; default parameters (D) skip tuning entirely. The exam expects informed search, not brute force.

Source: Databricks — Hyperparameter tuning → Further reading: PowerKram — Hyperparameter Tuning with Hyperopt →
Question 7 · Model Development

A data scientist uses Spark SQL to load data, then trains a model with Spark ML on a large distributed dataset. Which compute configuration is most appropriate?

  1. A Single Node cluster
  2. A multi-node (standard) cluster so Spark ML can distribute the work
  3. A SQL warehouse only
  4. A local laptop runtime
Show answer & explanation

Correct: B. Spark ML is a distributed library, so a multi-node cluster lets training scale across workers — the right fit for large datasets processed with Spark.

Why not the others: a Single Node cluster (A) suits small-data, single-machine libraries like scikit-learn, not distributed Spark ML at scale; a SQL warehouse (C) serves SQL analytics, not model training; a laptop runtime (D) cannot handle the distributed workload. Match compute to the library and data size.

Source: Databricks — train models / compute →
Question 8 · Model Development

An engineer has the best run's run_id and the logged model name "model", and wants to register it in the MLflow Model Registry as "best_model". Which approach is correct?

  1. Copy the model files manually into a new folder named best_model
  2. Call mlflow.register_model() with the model URI runs:/<run_id>/model and the name "best_model"
  3. Rename the experiment to best_model
  4. Re-train the model and save it locally as best_model.pkl
Show answer & explanation

Correct: B. Registering a logged model uses its URI — runs:/<run_id>/model — passed to mlflow.register_model() (or the registry API) with the desired registered name, creating a versioned entry.

Why not the others: manually copying files (A) bypasses the registry and its versioning; renaming the experiment (C) does not register a model; re-training to a local pickle (D) abandons MLflow lineage and the registry. The registry is the system of record.

Source: Databricks — MLflow Model Registry →
Question 9 · Model Deployment

A model must score a large Delta table on a nightly schedule, with results written back to a table — low-latency real-time responses are not required. Which deployment pattern fits best?

  1. A real-time Model Serving REST endpoint
  2. Batch inference, applying the registered model to the table on a schedule
  3. An always-on streaming endpoint per request
  4. Embedding the model in a mobile app
Show answer & explanation

Correct: B. Scheduled scoring of a large table with no latency requirement is the textbook case for batch inference — load the registered model, score the Delta table on a job schedule, and write results back.

Why not the others: a real-time endpoint (A) and a per-request streaming endpoint (C) add cost and complexity that a nightly batch does not need; embedding in a mobile app (D) does not match a server-side table-scoring job. Match the serving pattern to the latency requirement.

Source: Databricks — model inference (batch) → Further reading: PowerKram — Model Deployment Patterns →
Question 10 · Model Deployment

A new model version registered in the MLflow Model Registry has passed all tests, and the team wants to promote it for production use. What is the correct registry action?

  1. Delete the previous version and re-upload the new one
  2. Transition the model version to the Production stage (or assign the production alias) in the Model Registry
  3. Rename the model to include "prod" in the title
  4. Export the model and email it to the ops team
Show answer & explanation

Correct: B. Promotion is done inside the registry by transitioning the version to the Production stage (or, in the Unity Catalog model registry, assigning a production alias), which keeps lineage and lets downstream jobs reference the production version.

Why not the others: deleting and re-uploading (A) loses version history; renaming with "prod" (C) is cosmetic and not how stages work; emailing the model file (D) bypasses governance entirely. Stage/alias transitions are the supported mechanism.

Source: Databricks — Model Registry stages →

Keep going: Learning & Career resources

This certification pays off fastest when it sits on top of real platform skills and a clear sense of where the role leads. Two PowerKram hubs back this exam up.

Deep dive: exam structure, scoring, study path & recertification

Exam structure and how it’s scored

The exam delivers 48 scored multiple-choice questions in 90 minutes; additional unscored items may appear for calibration, with extra time factored in. Databricks does not publish a fixed numeric passing score on the official exam page, and your result is reported as pass or fail. Questions are scenario-based — they describe a workflow situation and ask for the correct Databricks tool, page, or function — so familiarity with the actual UI and APIs matters as much as ML theory. Read the exam-format deep dive →

What the four domains actually test

Databricks Machine Learning (38%) is the heaviest domain and centers on the platform itself — clusters and the ML Runtime, AutoML, Feature Store basics, and MLflow tracking and the Model Registry. Model Development (31%) covers building, tuning (Hyperopt), and evaluating models with Spark ML and scikit-learn. ML Workflows (19%) covers EDA, feature engineering, and data splitting. Model Deployment (12%) is the smallest but still tested through batch, streaming, and real-time patterns and registry stage transitions. Read the Databricks ML toolchain guide →

Realistic study path

Most candidates report a few weeks of preparation, scaled by hands-on experience. A workable plan: take the Databricks Academy ML learning path, then build a small end-to-end project — explore and feature-engineer a dataset, run AutoML, train and tune a Spark ML or scikit-learn model, log everything to MLflow, and register and deploy the best model. Watch for renamed or evolving services (for example Feature Store versus Feature Engineering in Unity Catalog) and practice in the actual UI, since some questions are UI-navigation specific. Read the study plan →

Cost, scheduling, and delivery

The registration fee is $200 USD plus applicable local taxes. The exam is proctored and can be taken online or at a test center, and is offered in English, Japanese, Portuguese (Brazil), and Korean. Online delivery requires a quiet private space and a system check through the proctoring provider. Databricks periodically offers discount vouchers through learning events. Verify current fees and scheduling on Databricks’ official page before booking. Databricks’ official certification page →

Recertification

The certification is valid for two years. To stay certified you retake and pass the current version of the exam before it expires — there is no continuing-education-credit alternative. Because Databricks refreshes the exam to track platform changes (service renames and new capabilities appear between editions), recertifying also keeps your validated skills current. Read the recertification guide →

Career outlook

Machine learning on Databricks is widely used across industries that run large-scale data and ML workloads, and a platform-specific associate credential signals practical, hands-on competence rather than only theory. The credential is most valuable paired with demonstrable project work and, over time, with the ML Professional certification for those moving into production MLOps. For salary ranges and role-specific paths, see the Career Hub. Career Hub — Machine Learning Engineer →

Frequently asked questions

Is the Databricks Machine Learning Associate exam hard?

It is an associate-level exam and most candidates find it fair with focused preparation, but it is not a pure-theory test. The 48 questions are scenario-based and often ask for the specific Databricks tool, UI page, or function name for a task, so candidates who only study ML concepts in the abstract tend to struggle on the platform-specific Databricks Machine Learning domain, which is the largest at 38%.

What is the passing score?

Databricks does not publish a fixed numeric passing score on the official exam page; results are reported as pass or fail based on overall performance across all questions. You may see "70%" quoted on third-party sites, but that figure is not confirmed by Databricks, so treat it as unofficial and aim to be comfortable across every domain rather than targeting a specific percentage.

Do I need experience or prerequisites to take it?

There are no formal prerequisites, so anyone can register. Databricks recommends about six months of hands-on experience with the ML tasks in the exam guide, along with basic Python and data-science fundamentals and comfort with Delta Lake and SQL. Building one small end-to-end project on Databricks closes most of the gap if you are newer to the platform.

Which tools and topics should I focus on most?

Prioritize the Databricks Machine Learning platform itself — the ML Runtime, AutoML, Feature Store, and especially MLflow tracking and the Model Registry — since that domain is 38% of the exam. Then Model Development (31%): Spark ML, scikit-learn, and hyperparameter tuning with Hyperopt. ML Workflows (19%) and Model Deployment (12%) round it out. Practicing in the actual Databricks UI helps, because some questions test UI navigation.

How does it differ from the ML Professional and Data Engineer Associate exams?

The ML Associate covers basic, hands-on ML tasks across the lifecycle at an entry level. The ML Professional goes deeper into MLOps, monitoring, and production deployment and is the natural next step. The Data Engineer Associate focuses on ETL pipelines, Delta Lake, and workflows rather than modeling, so it is a better fit if your work is pipeline-centric rather than ML-centric.

Is the certification worth it?

For data scientists and ML practitioners working on Databricks, it is a credible, platform-specific signal of hands-on competence and a common stepping stone toward the ML Professional credential. Its value is highest paired with demonstrable project work. At a $200 fee, the main cost is study time, and it is valid for two years before you recertify.

Start your free 24-hour practice trial

Full access to the question bank, both study modes, and domain-level scoring across all four exam areas. No credit card required.

Start free trial →