Mastering Data Preparation and Feature Engineering for Machine Learning

A Cross-Vendor Training Guide | by Synchronized Software L.L.C. | 1/22/2026

Aligned to: Certification Alignment: AWS ML Specialty, Google ML Engineer, Azure DP-100, CompTIA Data+, Salesforce Agentforce Specialist

Introduction

Data preparation and feature engineering often consume 60–80% of a machine learning project’s time. Yet this work is what separates successful ML projects from failures. As the saying goes: “Garbage in, garbage out.”

This guide covers essential techniques for preparing data and engineering features that will dramatically improve your model performance across all major cloud platforms. Whether you are preparing for certification or building production ML systems, the techniques in this article are universally applicable.

 

  ⚡ Why This Matters

  According to a 2024 Anaconda survey, data scientists spend an average of 45% of their time on data preparation alone. Organizations that invest in systematic data preparation practices see 3–5x improvements in model accuracy compared to those that skip directly to model training.

Business use case: E-commerce personalization at scale

Consider a mid-size e-commerce company with 2 million customers and 50,000 SKUs. Their raw data includes transaction logs, clickstream events, product catalog data, customer profiles, and seasonal marketing campaigns. Before any recommendation model can be built, the data team must:

1.      Deduplicate customer records across mobile app and web sessions (data cleaning)

2.      Impute missing product categories where catalog data is incomplete (missing data handling)

3.      Normalize price fields across multiple currencies (data transformation)

4.      Engineer features like “average basket size”, “days since last purchase”, and “category affinity score” (feature engineering)

5.      Select the top 50 features from 300+ candidates to avoid overfitting (feature selection)

Without this preparation pipeline, the recommendation model would produce irrelevant suggestions, damaging customer trust and reducing conversion rates.

 

The Data Pipeline

Understanding the ML data flow

The ML data pipeline follows a predictable sequence from raw data to model-ready features. Understanding each stage helps you plan your work and estimate timelines accurately.

 

Stage

Challenge

Key Activities

Typical Time

Collection

Data scattered across sources

ETL, API integration, streaming

10–15%

Cleaning

Missing values, errors, duplicates

Imputation, validation, deduplication

20–30%

Transformation

Wrong format, scale, encoding

Type conversion, scaling, encoding

15–20%

Feature Engineering

Raw data ≠ useful features

Creation, selection, extraction

20–30%

Validation

Quality assurance

Distribution checks, schema tests

5–10%

 

Vendor data pipeline services

Vendor

Service

Purpose

Best For

AWS

Glue + SageMaker Data Wrangler

ETL, visual data preparation

Large-scale batch processing

Google

Dataflow + Vertex AI Pipelines

Stream/batch processing

Real-time ML pipelines

Microsoft

Azure Data Factory + Azure ML

Data integration, preparation

Enterprise data estates

Salesforce

Data Cloud + MuleSoft

CRM data integration

Customer 360 and AI-powered CRM

Databricks

Delta Lake + Feature Store

Unified analytics

Lakehouse architecture

 

  🏢 Business use case: Financial fraud detection

  A regional bank processes 5 million transactions daily. Their fraud detection pipeline ingests transaction data from core banking, card processor feeds, and customer behavior logs. AWS Glue handles ETL from 12 source systems, SageMaker Data Wrangler provides visual data quality analysis, and the Feature Store serves real-time features like “transaction velocity” and “geographic anomaly score” to the fraud model with sub-100ms latency.

 

Exploratory Data Analysis (EDA)

Before any data preparation, understand your data through EDA. Skipping EDA is the most common mistake made by junior data scientists — it leads to wasted effort on irrelevant transformations and missed insights about data quality.

Key questions to answer

1.      What is the shape of the data? — Rows, columns, memory usage

2.      What are the data types? — Numeric, categorical, datetime, text

3.      What is the distribution? — Mean, median, mode, variance, skewness

4.      Are there missing values? — Count, percentage, patterns (MCAR, MAR, MNAR)

5.   Are there outliers? — Extreme values that could affect modeling

6.   What are the relationships? — Correlations, interactions, multicollinearity

7.   What is the target distribution? — Balanced or imbalanced classes

 

Statistical summary for numerical features

Statistic

What It Tells You

When to Worry

Count

Number of non-null values

Differs significantly between features

Mean

Central tendency (sensitive to outliers)

Very different from median

Median

Central tendency (robust to outliers)

Doesn’t match business expectations

Std Dev

Spread of values

Very large relative to mean

Min/Max

Range of values

Physically impossible values

Quartiles

Distribution shape

Large gap between Q3 and Max

Skewness

Asymmetry of distribution

|Skewness| > 2 (highly skewed)

Visualization techniques

Univariate analysis (single variable)

        Histograms: Distribution shape, bin width selection, normality assessment

        Box plots: Median, quartiles, outlier identification in one view

        Bar charts: Category frequencies, class imbalance detection

        KDE plots: Smooth density estimation for continuous variables

 

Bivariate analysis (two variables)

        Scatter plots: Linear and non-linear relationships, clusters

        Correlation heatmaps: Feature-to-feature and feature-to-target relationships

        Grouped bar charts: Categorical comparisons across segments

        Violin plots: Distribution comparison across categories

 

Multivariate analysis

        Pair plots: All pairwise relationships at a glance

        Parallel coordinates: High-dimensional pattern discovery

        t-SNE / UMAP: Non-linear dimensionality reduction for cluster visualization

        Andrews curves: Multivariate data represented as curves for pattern recognition

 

  🏥 Business use case: Healthcare patient readmission prediction

  A hospital network analyzed EDA results on 250,000 patient discharge records and discovered that 34% of “length of stay” values were missing. Further investigation revealed the missingness was MAR — patients transferred to other facilities had systematically missing stay durations. This insight led the team to use transfer status as a predictor variable and apply regression-based imputation, improving their readmission model’s AUC from 0.72 to 0.81.

Vendor EDA tools

Vendor

Tool

Key Capability

AWS

SageMaker Data Wrangler

300+ built-in analyses, bias detection

Google

Vertex AI Workbench

Jupyter-native, BigQuery integration

Microsoft

Azure ML Studio

Automated profiling, drift detection

Salesforce

Tableau + Einstein Discovery

Visual analytics with AI-powered insights

Open Source

Pandas Profiling / ydata-profiling

One-line comprehensive EDA reports

 Handling Missing Data

Missing values are ubiquitous in real-world data and must be addressed before modeling. The strategy you choose should be informed by why the data is missing, not just how much is missing.

 

Types of missing data

Missing Completely at Random (MCAR)

Missingness is independent of all variables. Example: Random sensor failures in IoT data. Safe to use any imputation method; listwise deletion is valid.

Missing at Random (MAR)

Missingness depends on observed variables. Example: Younger survey respondents less likely to report income. Use observed variables to inform imputation; multiple imputation recommended.

Missing Not at Random (MNAR)

Missingness depends on the missing value itself. Example: High earners choosing not to report income. Most challenging; may need domain expertise, sensitivity analysis, or specialized models.

 

Decision framework for missing data

% Missing

Recommendation

Methods

Risk Level

< 5%

Usually safe to impute

Mean, median, mode

Low

5–15%

Impute + create missing indicator feature

KNN, regression, MICE

Medium

15–30%

Advanced imputation; validate impact

Multiple imputation, MICE

High

> 30%

Consider dropping or domain-specific approach

Domain rules, model-based

Very High

 

Simple imputation methods

Method

When to Use

Advantages

Disadvantages

Mean

Numerical, symmetric

Simple, preserves mean

Reduces variance, distorts distribution

Median

Numerical, skewed

Robust to outliers

May not capture relationships

Mode

Categorical

Preserves most common value

May create artificial peak

Constant

Domain-specific

Explicit, interpretable

Requires domain knowledge

Forward/Back Fill

Time series

Preserves temporal order

Can propagate errors

 

Advanced imputation methods

Method

Description

When to Use

Complexity

KNN Imputation

Impute based on K similar records

Multivariate relationships matter

Medium

Regression

Predict missing from other features

Strong linear relationships

Medium

Multiple Imputation

Create M imputed datasets, pool results

Uncertainty quantification needed

High

MICE

Iterative chained equations

Multiple columns with missing values

High

Deep Learning

Autoencoders, GANs for imputation

Complex non-linear patterns

Very High

 

  🚗 Business use case: Automotive insurance risk scoring

  An insurance company discovered that 22% of their “vehicle mileage” field was missing. Using MICE imputation — incorporating vehicle age, urban/rural indicator, and policy type as predictors — they recovered this critical risk variable. The imputed mileage feature became the third most important predictor in their risk model, improving loss ratio predictions by 12% compared to simply dropping the records.

 

Handling Outliers

Outliers are data points significantly different from other observations. They can be legitimate extreme values, data errors, or indicators of rare but important phenomena.

 

Detecting outliers

Z-Score method

Formula: z = (x – μ) / σ — Flag as outlier if |z| > 3. Assumes normal distribution; sensitive to extreme outliers themselves.

Interquartile Range (IQR) method

Formula: IQR = Q3 – Q1 — Outlier if x < Q1 – 1.5×IQR or x > Q3 + 1.5×IQR. Robust to extreme values; works for skewed distributions.

Isolation Forest (advanced)

An unsupervised ensemble method that isolates outliers by randomly partitioning the feature space. Effective for high-dimensional data where simple statistical methods fail.

 

Handling strategies

Strategy

Description

When to Use

Impact

Remove

Delete outlier rows

Confirmed errors or different population

Reduces dataset size

Cap / Winsorize

Replace with boundary values (e.g., 1st/99th percentile)

Reduce influence while keeping data

Preserves record count

Transform

Log, square root, Box-Cox

Reduce skewness

Changes distribution shape

Bin

Convert to categories

When exact value is less important

Loses granularity

Separate Model

Build specific model for outlier segment

Outliers represent a valid sub-population

More complex pipeline

Keep

Do nothing

Valid data; using robust algorithms (trees)

No action needed

 

Data Transformation

Transform data into formats suitable for machine learning algorithms. The choice of transformation depends on your algorithm and the characteristics of your features.

 

Scaling numerical features

Most ML algorithms perform better when features are on similar scales. Distance-based and gradient-based algorithms are particularly sensitive to feature scale.

 

Technique

Formula

Range

Best For

Min-Max Scaling

(x – min) / (max – min)

[0, 1]

Neural networks, KNN, image data

Standardization (Z-score)

(x – mean) / std

~[–3, 3]

Linear models, SVM, PCA

Robust Scaling

(x – median) / IQR

Varies

Data with outliers

Log Transform

log(x + 1)

Varies

Right-skewed distributions

Power Transform (Box-Cox)

Automatic λ selection

Varies

Any non-normal distribution

Quantile Transform

Map to uniform/normal

[0, 1] or ~[–3, 3]

Non-parametric normalization

 

When to scale (algorithm cheat sheet)

Algorithm

Scaling Required?

Recommended Method

Reason

Linear / Logistic Regression

Yes

Standardization

Gradient descent convergence

SVM

Yes

Standardization

Distance-based kernel

K-Means, KNN

Yes

Either

Distance-based

Decision Trees / Random Forest

No

N/A

Split-based (scale invariant)

Gradient Boosting (XGBoost)

No

N/A

Tree-based

Neural Networks

Yes

Min-Max or Standardization

Activation function ranges

PCA

Yes

Standardization

Variance-based decomposition

 

  ⚠️ Critical rule

  Always fit the scaler on training data only, then transform both training and test sets using the training scaler. Fitting on all data causes data leakage — the scaler “sees” test set statistics, artificially inflating performance metrics.

 

Encoding Categorical Variables

Machine learning algorithms require numerical input. Categorical variables must be encoded, and the choice of encoding method significantly impacts model performance.

 

Types of categorical variables

        Nominal — No inherent order. Examples: colors (Red, Blue, Green), country, product category.

        Ordinal — Natural order exists. Examples: size (S < M < L), education level, satisfaction rating.

 

Encoding techniques comparison

Technique

How It Works

Best For

Watch Out For

One-Hot

Binary column per category

Nominal, < 15 categories

High dimensionality (curse of dimensionality)

Label Encoding

Integer per category

Ordinal, tree-based models

Implies false ordinality for nominal vars

Target (Mean) Encoding

Replace with target mean

High cardinality (100+)

Overfitting, data leakage risk

Frequency Encoding

Replace with category count

When frequency is meaningful

Different categories with same count

Binary Encoding

Label → binary digits

Medium cardinality (10–100)

Less interpretable

Embedding

Learned dense vectors

Very high cardinality (1000+)

Requires neural network training

Hash Encoding

Hash function to fixed dims

Extremely high cardinality

Hash collisions

 

Encoding decision guide

Cardinality

Variable Type

Recommended Encoding

Example

2 (binary)

Any

Binary (0/1)

Gender, Yes/No flags

3–10

Nominal

One-Hot Encoding

Color, Region, Department

3–10

Ordinal

Label Encoding (ordered)

Size, Rating, Education

11–100

Any

Target or Binary Encoding

City, Product sub-category

100–1000

Any

Target, Frequency, or Hashing

ZIP code, Company name

1000+

Any

Embedding or Hash Encoding

User ID, URL, free-text category

 

  🛒 Business use case: Retail demand forecasting

  A national grocery chain needed to predict weekly demand for 45,000 SKUs across 800 stores. The “store_id” field had 800 categories and “sku_id” had 45,000. One-hot encoding would have created 45,800 new columns. Instead, the team used target encoding for store_id (mean weekly sales per store) and learned embeddings for sku_id via a neural network. This reduced the feature space by 99.5% while capturing meaningful relationships, cutting training time from 14 hours to 45 minutes.

 

Feature Engineering

Feature engineering creates new features from existing data to improve model performance. A well-engineered feature can be worth 10× more than a sophisticated algorithm. This is where domain knowledge meets data science.

 

Numerical feature engineering

Mathematical transformations

Transformation

Formula

Use Case

Business Example

Log

log(x + 1)

Right-skewed data

Income, transaction amounts, page views

Square Root

√x

Count data, reduce skew

Number of support tickets, defect counts

Square

Amplify differences

Distance calculations, penalty terms

Reciprocal

1/x

Rate conversions

Speed → time, frequency → period

Box-Cox

Automatic λ

Normalize any distribution

Any non-normal continuous feature

 

Binning (discretization)

Convert continuous variables to categories. Methods include equal-width bins, equal-frequency (quantile) bins, and domain-driven bins. Particularly useful for creating interpretable features for business stakeholders.

 

Polynomial features

Create interaction and power terms: x₁, x₂ → x₁, x₂, x₁², x₂², x₁×x₂. Captures non-linear relationships but creates exponential growth in features. Use with regularization (L1/L2) to prevent overfitting.

 

Date/time feature engineering

Temporal features are among the most powerful in business applications. Extract meaningful components from timestamps:

 

Feature

Example

Captures

Business Application

Year

2024

Long-term trends

Revenue forecasting, market growth

Month

6

Seasonality

Retail demand, energy consumption

Day of Week

Monday

Weekly patterns

Call center staffing, ad performance

Hour

14

Daily patterns

Website traffic, trading volume

Is Weekend

True/False

Behavioral differences

E-commerce conversion, support volume

Is Holiday

True/False

Special events

Shipping delays, sales spikes

Days Since Event

45

Recency effects

Customer churn, campaign response

Quarter End

True/False

Business cycles

Sales pipeline, financial reporting

 

Cyclical encoding tip: For periodic features like month or day of week, use sine/cosine encoding: sin_month = sin(2π × month / 12), cos_month = cos(2π × month / 12). This makes December (12) close to January (1) in feature space.

 

Text feature engineering

Technique

How It Works

Complexity

Best For

Basic Statistics

Character count, word count, avg word length

Low

Quick signal extraction

Bag of Words

Count each word occurrence

Low

Simple text classification

TF-IDF

Frequency balanced by uniqueness

Medium

Standard text classification, search

Word2Vec / GloVe

Dense word vectors from co-occurrence

Medium

Semantic similarity, analogy tasks

FastText

Sub-word embeddings

Medium

Handling typos, rare words

BERT / Transformers

Contextual sentence embeddings

High

State-of-the-art NLP tasks

LLM Embeddings

GPT-4, Claude embeddings

High

Zero-shot, few-shot classification

 

Aggregation features

Summarize groups of records to create entity-level features. These are critical in customer analytics, fraud detection, and any domain with transactional data.

 

Aggregation

Example

Business Signal

Count

Number of transactions per customer

Engagement level

Sum

Total spend per customer per quarter

Customer lifetime value

Mean

Average order value

Purchase behavior

Min / Max

Largest and smallest purchase

Spending range, premium behavior

Std Dev

Variability in transaction amounts

Consistency of behavior

Time Since Last

Days since last purchase

Churn risk indicator

Trend

Month-over-month spend change

Growth or decline trajectory

Ratio

Returns / Total orders

Product satisfaction proxy

 

  ✈️ Business use case: Airline customer loyalty prediction

  An airline’s data science team engineered 120 features from raw booking and flight data. The top 5 features by importance were all aggregation-based: “total miles flown (12 months)”, “days since last flight”, “ratio of upgrades to total bookings”, “std dev of booking lead time”, and “count of international segments.” These five features alone explained 73% of the variance in loyalty tier transitions, demonstrating the power of thoughtful aggregation over raw transactional data.

 

Feature Selection

Not all features improve models. Feature selection identifies the most valuable ones to reduce overfitting, improve accuracy, decrease training time, and enhance interpretability. In production systems, fewer features also mean lower serving costs and latency.

 

Filter methods

Statistical tests independent of the model. Fast to compute but ignore feature interactions.

        Variance Threshold: Remove features with near-zero variance (no signal)

        Correlation Analysis: Remove one of two highly correlated features (r > 0.90)

        Chi-Square Test: Test independence between categorical features and target

        Mutual Information: Measure non-linear shared information between feature and target

        ANOVA F-test: Test whether means of target differ across feature levels

 

Wrapper methods

Use model performance to select features. More accurate but computationally expensive.

        Forward Selection: Start empty, add the feature that most improves performance each step

        Backward Elimination: Start with all features, remove the least impactful each step

        Recursive Feature Elimination (RFE): Train model, remove least important feature(s), repeat

        Exhaustive Search: Try all possible subsets (only feasible for small feature sets)

 

Embedded methods

Feature selection built into the training algorithm. Balances accuracy with efficiency.

        L1 Regularization (Lasso): Drives coefficients to exactly zero, performing automatic selection

        Tree Feature Importance: Rank features by their contribution to split quality (Gini / information gain)

        SHAP Values: Model-agnostic importance with theoretical guarantees from game theory

        Permutation Importance: Measure performance drop when a feature’s values are shuffled

 

Feature selection decision matrix

Scenario

Recommended Approach

Rationale

< 20 features

Backward elimination or RFE

Small enough for wrapper methods

20–100 features

Filter + embedded (Lasso or tree importance)

Balance speed and accuracy

100–1000 features

Filter first, then embedded

Filter removes obvious noise quickly

1000+ features

Variance threshold → correlation → L1

Aggressive reduction needed

Interpretability required

SHAP + domain expert review

Explainability is the priority

 

Data Splitting

Properly splitting data is critical for unbiased model evaluation. The goal is to simulate how your model will perform on truly unseen data.

 

Standard splitting strategies

Strategy

How It Works

When to Use

Train / Test (80/20)

Single random split

Large datasets, quick experiments

Train / Val / Test (60/20/20)

Three-way split

Hyperparameter tuning + final evaluation

K-Fold Cross-Validation

K splits, rotate test fold

Moderate datasets, robust estimation

Stratified K-Fold

K-Fold preserving class ratios

Imbalanced classification

Time-Based Split

Train on past, test on future

Time series, forecasting

Group K-Fold

Keep related records together

Multi-record entities (customers, patients)

 

Splitting best practices

13.   Maintain target distribution: Use stratified splitting for classification tasks

14.   Respect temporal order: Never randomly split time series data; always use chronological splits

15.   Keep groups together: All records for a single customer or patient must be in the same split

16.   Hold out test set early: Never touch the test set until final evaluation

17.   Document your split: Record random seeds and split logic for reproducibility

 

Data leakage prevention

Data leakage is the single most common cause of unrealistically high model performance. It occurs when information from outside the training set influences the model.

 

  ⚠️ Common leakage sources

  1) Fitting scalers/encoders on all data before splitting. 2) Using future information for time series predictions. 3) Target encoding without proper cross-validation folds. 4) Features that directly encode the target (e.g., “approval_date” leaking into a loan approval model). 5) Aggregate features computed across train+test combined.

  🏦 Business use case: Credit scoring model leakage

  A consumer lending company built a credit risk model with 98% accuracy — suspiciously high. Investigation revealed that a “loan_grade” feature was included in training, which was itself a downstream output of the approval process. Removing this leaked feature dropped accuracy to 81%, which was the model’s true predictive power. The team then focused on engineering legitimate features from application-time data, ultimately reaching 85% accuracy without leakage.

 

Feature Stores

Feature stores are centralized repositories for ML features that solve critical production challenges: duplicated feature engineering code, training/serving skew, lack of feature discoverability, and inconsistent feature definitions across teams.

 

Why feature stores matter in production

        Consistency: Same feature definition used in training and real-time serving

        Reusability: Teams share features instead of rebuilding from scratch

        Point-in-time correctness: Historical features are computed as of the prediction time, preventing leakage

        Discoverability: Central catalog of available features with documentation and lineage

        Low-latency serving: Pre-computed features served in milliseconds for real-time predictions

 

Vendor feature store comparison

Vendor

Service

Offline Store

Online Store

Key Differentiator

AWS

SageMaker Feature Store

S3 + Glue

DynamoDB

Deep SageMaker integration

Google

Vertex AI Feature Store

BigQuery

Bigtable

BigQuery SQL-native

Microsoft

Azure ML Feature Store

ADLS

Redis/Cosmos

MLflow integration

Databricks

Feature Store (Unity Catalog)

Delta Lake

Databricks SQL

Lakehouse-native

Open Source

Feast

Any data warehouse

Redis / DynamoDB

Vendor-agnostic, portable

 

  🚀 Business use case: Real-time recommendation engine

  A streaming media company serves 50 million recommendations per day. Their feature store pre-computes 200+ user features (watch history aggregates, genre preferences, session behavior) and 80+ content features (popularity trends, freshness scores, engagement metrics). At serving time, the recommendation model fetches user and content features from the online store in < 10ms, combines them, and returns personalized rankings. Without the feature store, this would require re-computing features from raw event logs for each request — a 30-second operation compressed to 10 milliseconds.

 

End-to-End Pipeline Reference

This section provides a consolidated view of the complete data preparation pipeline, from raw data to model-ready features, with decision points at each stage.

 

Step

Key Decision

Methods

Output

1. Collect & Profile

What data do I have?

EDA, profiling reports

Data quality assessment

2. Handle Missing

Why is it missing?

MCAR/MAR/MNAR analysis → imputation strategy

Complete dataset

3. Handle Outliers

Error or valid extreme?

Z-score, IQR, Isolation Forest

Clean or transformed values

4. Transform

What scale do I need?

Normalize, standardize, log-transform

Scaled features

5. Encode

What type and cardinality?

One-hot, label, target, embedding

Numerical features

6. Engineer

What signal is hidden?

Aggregations, temporal, text, interactions

New feature candidates

7. Select

Which features help?

Filter, wrapper, embedded methods

Reduced feature set

8. Split & Validate

Is my evaluation fair?

Stratified, temporal, grouped splits

Train/val/test sets

9. Store & Serve

How to operationalize?

Feature store, pipeline orchestration

Production-ready pipeline

 

Key Takeaways

1.   Data preparation consumes 60–80% of ML project time but determines success or failure

2.   EDA is non-negotiable — understand your data deeply before transforming it

3.   Handle missing data based on why it’s missing, not just how much is missing

4.   Scale features based on your algorithm’s requirements — and always fit on training data only

5.   Match encoding strategy to cardinality and variable type — one-hot is not always the answer

6.   Feature engineering is where domain knowledge creates competitive advantage

7.   Feature selection prevents overfitting and reduces production serving costs

8.   Data splitting done wrong invalidates your entire evaluation — especially watch for leakage

9.   Feature stores bridge the gap between experimentation and production ML

10.   The best practitioners iterate: EDA → transform → model → analyze errors → re-engineer features

 

Additional Learning Resources

Official vendor documentation

        AWS: docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html

        Google: cloud.google.com/architecture/data-preprocessing-for-ml-with-tf-transform-pt1

        Microsoft: learn.microsoft.com/azure/machine-learning/concept-data

        Salesforce: help.salesforce.com/s/articleView?id=sf.c360_a_data_cloud.htm

        Databricks: docs.databricks.com/machine-learning/feature-store/

 

Certification preparation

        AWS ML Specialty: aws.amazon.com/certification/certified-machine-learning-specialty/

        Google ML Engineer: cloud.google.com/learn/certification/machine-learning-engineer

        Azure DP-100: learn.microsoft.com/certifications/exams/dp-100

        CompTIA Data+: comptia.org/certifications/data

        Salesforce AI Associate: trailhead.salesforce.com/credentials/ai-associate

 

 

Article 3 of 5  |  AI/ML Foundations Training Series  |  Level: Beginner to Intermediate  |  Estimated Reading Time: 45 minutes  |  Last Updated: March 2026

 

Check Your Knowledge

A data scientist discovers that 22% of the “vehicle_mileage” field is missing in an automotive insurance dataset. Investigation shows that the missingness correlates with policy type — commercial policies are far less likely to have mileage recorded than personal policies. Which type of missing data does this represent, and what is the MOST appropriate imputation strategy?

A) MCAR — use mean imputation since the missingness is random.

B) MAR — use MICE imputation incorporating policy type and other observed variables as predictors.

C) MNAR — drop all records with missing mileage since the data cannot be recovered.

D) MCAR — use listwise deletion since any imputation method is valid.

Correct answer: B – Explanation: When missingness depends on observed variables (policy type), the data is Missing at Random (MAR). MICE (Multiple Imputation by Chained Equations) is ideal here because it leverages relationships between observed variables — such as vehicle age, urban/rural indicator, and policy type — to impute missing values iteratively. Mean imputation (A, D) ignores these relationships and distorts the distribution. Dropping all records (C) wastes valuable data and incorrectly classifies the missingness as MNAR.

A retail company is building a demand forecasting model. Their dataset includes a “store_id” field with 800 unique values and a “sku_id” field with 45,000 unique values. The team needs to encode both categorical variables for a gradient boosting model.

Which encoding approach is MOST appropriate?

A) One-hot encode both fields to preserve all categorical information.

B) Use label encoding for both fields since gradient boosting is tree-based and scale-invariant.

C) Use target encoding for store_id and learned embeddings for sku_id.

D) Apply frequency encoding to both fields since category counts capture meaningful signal.

Correct answer: CExplanation: One-hot encoding (A) would create 45,800 new columns, causing extreme dimensionality and memory issues. Label encoding (B) technically works for tree-based models but wastes the opportunity to inject meaningful signal into the features. Target encoding replaces store_id with mean sales per store — a compact, informative representation for medium cardinality (800). For very high cardinality (45,000 SKUs), learned embeddings capture complex relationships in dense vector form. Frequency encoding (D) risks information loss when different categories share the same count.

A machine learning team is preparing features for a K-Nearest Neighbors (KNN) classification model. The dataset contains “annual_income” (range: $20,000–$500,000) and “age” (range: 18–85). The team fits a StandardScaler on the entire dataset before splitting into training and test sets.

What is the PRIMARY problem with this approach?

A) StandardScaler is the wrong choice for KNN — Min-Max scaling should be used instead. B) The features should not be scaled at all since KNN is a distance-based algorithm that benefits from raw magnitudes.

C) Fitting the scaler on all data before splitting introduces data leakage — test set statistics influence the training transformation.

D) The income feature is right-skewed, so a log transform should be applied before any scaling.

Correct answer: C – Explanation: The critical rule is to fit scalers on training data only, then transform both training and test sets using the training-fitted scaler. When you fit on all data, the scaler’s mean and standard deviation incorporate test set values, allowing information from the test set to “leak” into the training pipeline. This artificially inflates performance metrics. While KNN does require scaling (ruling out B), and both StandardScaler and Min-Max work for KNN (A is not the primary issue), and log transforms may help skewed data (D is secondary), the data leakage from fitting before splitting is the most consequential error.

A data engineering team is building a real-time fraud detection system that processes 5 million transactions daily. They need features like “transaction velocity” (transactions per hour per customer) and “geographic anomaly score” available at prediction time with sub-100ms latency. Currently, these features are computed during batch model training from raw event logs.

Which infrastructure component BEST solves the training-serving consistency and latency requirements?

A) A batch ETL pipeline that recomputes all features nightly and stores them in a data warehouse.

B) A feature store with both an offline store for training and an online store for low-latency serving.

C) An in-memory cache that stores the most recent model predictions for each customer. D) A streaming pipeline that recomputes features from raw logs for every incoming transaction at serving time.

Correct answer: B – Explanation: A feature store solves both problems simultaneously. The offline store (backed by a data warehouse like S3 or BigQuery) provides point-in-time correct features for training, while the online store (backed by a low-latency database like DynamoDB or Redis) serves pre-computed features in milliseconds at prediction time. This ensures the same feature definitions are used in training and serving, eliminating training-serving skew. A nightly batch pipeline (A) cannot deliver sub-100ms latency. Caching predictions (C) does not address feature computation. Recomputing features from raw logs per request (D) would take seconds, not milliseconds, at this transaction volume.

A data scientist is performing Exploratory Data Analysis on patient discharge records. They notice that the mean of the “length_of_stay” feature is 6.8 days, but the median is only 3.2 days. The skewness value is 4.1.

Which combination of observations and next steps is MOST accurate?

A) The distribution is left-skewed; apply square transformation to normalize it before modeling.

B) The distribution is approximately normal; the mean and median difference is within acceptable range.

C) The distribution is heavily right-skewed; consider a log transform and investigate outliers using the IQR method.

D) The skewness indicates data quality issues; drop all records above the mean to correct the distribution.

Correct answer: C – Explanation: When the mean is substantially higher than the median, the distribution is right-skewed — a small number of very long hospital stays are pulling the mean upward. A skewness of 4.1 (well above the |2| threshold) confirms this is heavily right-skewed. The appropriate next steps are to apply a log transform (which compresses right-skewed distributions toward normality) and investigate potential outliers using the IQR method (which is robust to extreme values, unlike Z-scores that assume normality). Left-skewed (A) would show mean < median. The distribution is clearly not normal (B). Dropping records above the mean (D) would discard nearly half the dataset and is not a valid statistical technique.

Choose Your AI Certification Path

Whether you’re exploring AI on Google Cloud, Azure, Salesforce, AWS, or Databricks, PowerKram gives you vendor‑aligned practice exams built from real exam objectives — not dumps.

Start with a free 24‑hour trial for the vendor that matches your goals.

3 thoughts on “Data Preparation and Feature Engineering”

Leave a Comment

Your email address will not be published. Required fields are marked *