Mastering Data Preparation and Feature Engineering for Machine Learning

Q: Question #1

A data scientist discovers that 22% of the “vehicle_mileage” field is missing in an automotive insurance dataset. Investigation shows that the missingness correlates with policy type — commercial policies are far less likely to have mileage recorded than personal policies. Which type of missing data does this represent, and what is the MOST appropriate imputation strategy? A) MCAR — use mean imputation since the missingness is random. B) MAR — use MICE imputation incorporating policy type and other observed variables as predictors. C) MNAR — drop all records with missing mileage since the data cannot be recovered. D) MCAR — use listwise deletion since any imputation method is valid. Solution Correct answer: B – Explanation: When missingness depends on observed variables (policy type), the data is Missing at Random (MAR). MICE (Multiple Imputation by Chained Equations) is ideal here because it leverages relationships between observed variables — such as vehicle age, urban/rural indicator, and policy type — to impute missing values iteratively. Mean imputation (A, D) ignores these relationships and distorts the distribution. Dropping all records (C) wastes valuable data and incorrectly classifies the missingness as MNAR.

Q: Question #2

A retail company is building a demand forecasting model. Their dataset includes a “store_id” field with 800 unique values and a “sku_id” field with 45,000 unique values. The team needs to encode both categorical variables for a gradient boosting model. Which encoding approach is MOST appropriate? A) One-hot encode both fields to preserve all categorical information. B) Use label encoding for both fields since gradient boosting is tree-based and scale-invariant. C) Use target encoding for store_id and learned embeddings for sku_id. D) Apply frequency encoding to both fields since category counts capture meaningful signal. Solution Correct answer: C – Explanation: One-hot encoding (A) would create 45,800 new columns, causing extreme dimensionality and memory issues. Label encoding (B) technically works for tree-based models but wastes the opportunity to inject meaningful signal into the features. Target encoding replaces store_id with mean sales per store — a compact, informative representation for medium cardinality (800). For very high cardinality (45,000 SKUs), learned embeddings capture complex relationships in dense vector form. Frequency encoding (D) risks information loss when different categories share the same count.

Q: Question #3

A machine learning team is preparing features for a K-Nearest Neighbors (KNN) classification model. The dataset contains “annual_income” (range: $20,000–$500,000) and “age” (range: 18–85). The team fits a StandardScaler on the entire dataset before splitting into training and test sets. What is the PRIMARY problem with this approach? A) StandardScaler is the wrong choice for KNN — Min-Max scaling should be used instead. B) The features should not be scaled at all since KNN is a distance-based algorithm that benefits from raw magnitudes. C) Fitting the scaler on all data before splitting introduces data leakage — test set statistics influence the training transformation. D) The income feature is right-skewed, so a log transform should be applied before any scaling. Solution Correct answer: C – Explanation: The critical rule is to fit scalers on training data only, then transform both training and test sets using the training-fitted scaler. When you fit on all data, the scaler’s mean and standard deviation incorporate test set values, allowing information from the test set to “leak” into the training pipeline. This artificially inflates performance metrics. While KNN does require scaling (ruling out B), and both StandardScaler and Min-Max work for KNN (A is not the primary issue), and log transforms may help skewed data (D is secondary), the data leakage from fitting before splitting is the most consequential error.

Q: Question #4

A data engineering team is building a real-time fraud detection system that processes 5 million transactions daily. They need features like “transaction velocity” (transactions per hour per customer) and “geographic anomaly score” available at prediction time with sub-100ms latency. Currently, these features are computed during batch model training from raw event logs. Which infrastructure component BEST solves the training-serving consistency and latency requirements? A) A batch ETL pipeline that recomputes all features nightly and stores them in a data warehouse. B) A feature store with both an offline store for training and an online store for low-latency serving. C) An in-memory cache that stores the most recent model predictions for each customer. D) A streaming pipeline that recomputes features from raw logs for every incoming transaction at serving time. Solution Correct answer: B – Explanation: A feature store solves both problems simultaneously. The offline store (backed by a data warehouse like S3 or BigQuery) provides point-in-time correct features for training, while the online store (backed by a low-latency database like DynamoDB or Redis) serves pre-computed features in milliseconds at prediction time. This ensures the same feature definitions are used in training and serving, eliminating training-serving skew. A nightly batch pipeline (A) cannot deliver sub-100ms latency. Caching predictions (C) does not address feature computation. Recomputing features from raw logs per request (D) would take seconds, not milliseconds, at this transaction volume.

Q: Question #5

A data scientist is performing Exploratory Data Analysis on patient discharge records. They notice that the mean of the “length_of_stay” feature is 6.8 days, but the median is only 3.2 days. The skewness value is 4.1. Which combination of observations and next steps is MOST accurate? A) The distribution is left-skewed; apply square transformation to normalize it before modeling. B) The distribution is approximately normal; the mean and median difference is within acceptable range. C) The distribution is heavily right-skewed; consider a log transform and investigate outliers using the IQR method. D) The skewness indicates data quality issues; drop all records above the mean to correct the distribution. Solution Correct answer: C – Explanation: When the mean is substantially higher than the median, the distribution is right-skewed — a small number of very long hospital stays are pulling the mean upward. A skewness of 4.1 (well above the |2| threshold) confirms this is heavily right-skewed. The appropriate next steps are to apply a log transform (which compresses right-skewed distributions toward normality) and investigate potential outliers using the IQR method (which is robust to extreme values, unlike Z-scores that assume normality). Left-skewed (A) would show mean < median. The distribution is clearly not normal (B). Dropping records above the mean (D) would discard nearly half the dataset and is not a valid statistical technique.

A Cross-Vendor Training Guide | by Synchronized Software L.L.C. | 1/22/2026

Aligned to: Certification Alignment: AWS ML Specialty, Google ML Engineer, Azure DP-100, CompTIA Data+, Salesforce Agentforce Specialist

Introduction

Data preparation and feature engineering often consume 60–80% of a machine learning project’s time. Yet this work is what separates successful ML projects from failures. As the saying goes: “Garbage in, garbage out.”

This guide covers essential techniques for preparing data and engineering features that will dramatically improve your model performance across all major cloud platforms. Whether you are preparing for certification or building production ML systems, the techniques in this article are universally applicable.

⚡ Why This Matters

According to a 2024 Anaconda survey, data scientists spend an average of 45% of their time on data preparation alone. Organizations that invest in systematic data preparation practices see 3–5x improvements in model accuracy compared to those that skip directly to model training.

Business use case: E-commerce personalization at scale

Consider a mid-size e-commerce company with 2 million customers and 50,000 SKUs. Their raw data includes transaction logs, clickstream events, product catalog data, customer profiles, and seasonal marketing campaigns. Before any recommendation model can be built, the data team must:

1. Deduplicate customer records across mobile app and web sessions (data cleaning)

2. Impute missing product categories where catalog data is incomplete (missing data handling)

3. Normalize price fields across multiple currencies (data transformation)

4. Engineer features like “average basket size”, “days since last purchase”, and “category affinity score” (feature engineering)

5. Select the top 50 features from 300+ candidates to avoid overfitting (feature selection)

Without this preparation pipeline, the recommendation model would produce irrelevant suggestions, damaging customer trust and reducing conversion rates.

The Data Pipeline

Understanding the ML data flow

The ML data pipeline follows a predictable sequence from raw data to model-ready features. Understanding each stage helps you plan your work and estimate timelines accurately.

Stage	Challenge	Key Activities	Typical Time
Collection	Data scattered across sources	ETL, API integration, streaming	10–15%
Cleaning	Missing values, errors, duplicates	Imputation, validation, deduplication	20–30%
Transformation	Wrong format, scale, encoding	Type conversion, scaling, encoding	15–20%
Feature Engineering	Raw data ≠ useful features	Creation, selection, extraction	20–30%
Validation	Quality assurance	Distribution checks, schema tests	5–10%

Vendor data pipeline services

Vendor	Service	Purpose	Best For
AWS	Glue + SageMaker Data Wrangler	ETL, visual data preparation	Large-scale batch processing
Google	Dataflow + Vertex AI Pipelines	Stream/batch processing	Real-time ML pipelines
Microsoft	Azure Data Factory + Azure ML	Data integration, preparation	Enterprise data estates
Salesforce	Data Cloud + MuleSoft	CRM data integration	Customer 360 and AI-powered CRM
Databricks	Delta Lake + Feature Store	Unified analytics	Lakehouse architecture

🏢 Business use case: Financial fraud detection

A regional bank processes 5 million transactions daily. Their fraud detection pipeline ingests transaction data from core banking, card processor feeds, and customer behavior logs. AWS Glue handles ETL from 12 source systems, SageMaker Data Wrangler provides visual data quality analysis, and the Feature Store serves real-time features like “transaction velocity” and “geographic anomaly score” to the fraud model with sub-100ms latency.

Exploratory Data Analysis (EDA)

Before any data preparation, understand your data through EDA. Skipping EDA is the most common mistake made by junior data scientists — it leads to wasted effort on irrelevant transformations and missed insights about data quality.

Key questions to answer

1. What is the shape of the data? — Rows, columns, memory usage

2. What are the data types? — Numeric, categorical, datetime, text

3. What is the distribution? — Mean, median, mode, variance, skewness

4. Are there missing values? — Count, percentage, patterns (MCAR, MAR, MNAR)

5. Are there outliers? — Extreme values that could affect modeling

6. What are the relationships? — Correlations, interactions, multicollinearity

7. What is the target distribution? — Balanced or imbalanced classes

Statistical summary for numerical features

Statistic	What It Tells You	When to Worry
Count	Number of non-null values	Differs significantly between features
Mean	Central tendency (sensitive to outliers)	Very different from median
Median	Central tendency (robust to outliers)	Doesn’t match business expectations
Std Dev	Spread of values	Very large relative to mean
Min/Max	Range of values	Physically impossible values
Quartiles	Distribution shape	Large gap between Q3 and Max
Skewness	Asymmetry of distribution	\|Skewness\| > 2 (highly skewed)

Visualization techniques

Univariate analysis (single variable)

• Histograms: Distribution shape, bin width selection, normality assessment

• Box plots: Median, quartiles, outlier identification in one view

• Bar charts: Category frequencies, class imbalance detection

• KDE plots: Smooth density estimation for continuous variables

Bivariate analysis (two variables)

• Scatter plots: Linear and non-linear relationships, clusters

• Correlation heatmaps: Feature-to-feature and feature-to-target relationships

• Grouped bar charts: Categorical comparisons across segments

• Violin plots: Distribution comparison across categories

Multivariate analysis

• Pair plots: All pairwise relationships at a glance

• Parallel coordinates: High-dimensional pattern discovery

• t-SNE / UMAP: Non-linear dimensionality reduction for cluster visualization

• Andrews curves: Multivariate data represented as curves for pattern recognition

🏥 Business use case: Healthcare patient readmission prediction

A hospital network analyzed EDA results on 250,000 patient discharge records and discovered that 34% of “length of stay” values were missing. Further investigation revealed the missingness was MAR — patients transferred to other facilities had systematically missing stay durations. This insight led the team to use transfer status as a predictor variable and apply regression-based imputation, improving their readmission model’s AUC from 0.72 to 0.81.

Vendor EDA tools

Vendor	Tool	Key Capability
AWS	SageMaker Data Wrangler	300+ built-in analyses, bias detection
Google	Vertex AI Workbench	Jupyter-native, BigQuery integration
Microsoft	Azure ML Studio	Automated profiling, drift detection
Salesforce	Tableau + Einstein Discovery	Visual analytics with AI-powered insights
Open Source	Pandas Profiling / ydata-profiling	One-line comprehensive EDA reports

Handling Missing Data

Missing values are ubiquitous in real-world data and must be addressed before modeling. The strategy you choose should be informed by why the data is missing, not just how much is missing.

Types of missing data

Missing Completely at Random (MCAR)

Missingness is independent of all variables. Example: Random sensor failures in IoT data. Safe to use any imputation method; listwise deletion is valid.

Missing at Random (MAR)

Missingness depends on observed variables. Example: Younger survey respondents less likely to report income. Use observed variables to inform imputation; multiple imputation recommended.

Missing Not at Random (MNAR)

Missingness depends on the missing value itself. Example: High earners choosing not to report income. Most challenging; may need domain expertise, sensitivity analysis, or specialized models.

Decision framework for missing data

% Missing	Recommendation	Methods	Risk Level
< 5%	Usually safe to impute	Mean, median, mode	Low
5–15%	Impute + create missing indicator feature	KNN, regression, MICE	Medium
15–30%	Advanced imputation; validate impact	Multiple imputation, MICE	High
> 30%	Consider dropping or domain-specific approach	Domain rules, model-based	Very High

Simple imputation methods

Method	When to Use	Advantages	Disadvantages
Mean	Numerical, symmetric	Simple, preserves mean	Reduces variance, distorts distribution
Median	Numerical, skewed	Robust to outliers	May not capture relationships
Mode	Categorical	Preserves most common value	May create artificial peak
Constant	Domain-specific	Explicit, interpretable	Requires domain knowledge
Forward/Back Fill	Time series	Preserves temporal order	Can propagate errors

Advanced imputation methods

Method	Description	When to Use	Complexity
KNN Imputation	Impute based on K similar records	Multivariate relationships matter	Medium
Regression	Predict missing from other features	Strong linear relationships	Medium
Multiple Imputation	Create M imputed datasets, pool results	Uncertainty quantification needed	High
MICE	Iterative chained equations	Multiple columns with missing values	High
Deep Learning	Autoencoders, GANs for imputation	Complex non-linear patterns	Very High

🚗 Business use case: Automotive insurance risk scoring

An insurance company discovered that 22% of their “vehicle mileage” field was missing. Using MICE imputation — incorporating vehicle age, urban/rural indicator, and policy type as predictors — they recovered this critical risk variable. The imputed mileage feature became the third most important predictor in their risk model, improving loss ratio predictions by 12% compared to simply dropping the records.

Handling Outliers

Outliers are data points significantly different from other observations. They can be legitimate extreme values, data errors, or indicators of rare but important phenomena.

Detecting outliers

Z-Score method

Formula: z = (x – μ) / σ — Flag as outlier if |z| > 3. Assumes normal distribution; sensitive to extreme outliers themselves.

Interquartile Range (IQR) method

Formula: IQR = Q3 – Q1 — Outlier if x < Q1 – 1.5×IQR or x > Q3 + 1.5×IQR. Robust to extreme values; works for skewed distributions.

Isolation Forest (advanced)

An unsupervised ensemble method that isolates outliers by randomly partitioning the feature space. Effective for high-dimensional data where simple statistical methods fail.

Handling strategies

Strategy	Description	When to Use	Impact
Remove	Delete outlier rows	Confirmed errors or different population	Reduces dataset size
Cap / Winsorize	Replace with boundary values (e.g., 1st/99th percentile)	Reduce influence while keeping data	Preserves record count
Transform	Log, square root, Box-Cox	Reduce skewness	Changes distribution shape
Bin	Convert to categories	When exact value is less important	Loses granularity
Separate Model	Build specific model for outlier segment	Outliers represent a valid sub-population	More complex pipeline
Keep	Do nothing	Valid data; using robust algorithms (trees)	No action needed

Data Transformation

Transform data into formats suitable for machine learning algorithms. The choice of transformation depends on your algorithm and the characteristics of your features.

Scaling numerical features

Most ML algorithms perform better when features are on similar scales. Distance-based and gradient-based algorithms are particularly sensitive to feature scale.

Technique	Formula	Range	Best For
Min-Max Scaling	(x – min) / (max – min)	[0, 1]	Neural networks, KNN, image data
Standardization (Z-score)	(x – mean) / std	~[–3, 3]	Linear models, SVM, PCA
Robust Scaling	(x – median) / IQR	Varies	Data with outliers
Log Transform	log(x + 1)	Varies	Right-skewed distributions
Power Transform (Box-Cox)	Automatic λ selection	Varies	Any non-normal distribution
Quantile Transform	Map to uniform/normal	[0, 1] or ~[–3, 3]	Non-parametric normalization

When to scale (algorithm cheat sheet)

Algorithm	Scaling Required?	Recommended Method	Reason
Linear / Logistic Regression	Yes	Standardization	Gradient descent convergence
SVM	Yes	Standardization	Distance-based kernel
K-Means, KNN	Yes	Either	Distance-based
Decision Trees / Random Forest	No	N/A	Split-based (scale invariant)
Gradient Boosting (XGBoost)	No	N/A	Tree-based
Neural Networks	Yes	Min-Max or Standardization	Activation function ranges
PCA	Yes	Standardization	Variance-based decomposition

⚠️ Critical rule

Always fit the scaler on training data only, then transform both training and test sets using the training scaler. Fitting on all data causes data leakage — the scaler “sees” test set statistics, artificially inflating performance metrics.

Encoding Categorical Variables

Machine learning algorithms require numerical input. Categorical variables must be encoded, and the choice of encoding method significantly impacts model performance.

Types of categorical variables

• Nominal — No inherent order. Examples: colors (Red, Blue, Green), country, product category.

• Ordinal — Natural order exists. Examples: size (S < M < L), education level, satisfaction rating.

Encoding techniques comparison

Technique	How It Works	Best For	Watch Out For
One-Hot	Binary column per category	Nominal, < 15 categories	High dimensionality (curse of dimensionality)
Label Encoding	Integer per category	Ordinal, tree-based models	Implies false ordinality for nominal vars
Target (Mean) Encoding	Replace with target mean	High cardinality (100+)	Overfitting, data leakage risk
Frequency Encoding	Replace with category count	When frequency is meaningful	Different categories with same count
Binary Encoding	Label → binary digits	Medium cardinality (10–100)	Less interpretable
Embedding	Learned dense vectors	Very high cardinality (1000+)	Requires neural network training
Hash Encoding	Hash function to fixed dims	Extremely high cardinality	Hash collisions

Encoding decision guide

Cardinality	Variable Type	Recommended Encoding	Example
2 (binary)	Any	Binary (0/1)	Gender, Yes/No flags
3–10	Nominal	One-Hot Encoding	Color, Region, Department
3–10	Ordinal	Label Encoding (ordered)	Size, Rating, Education
11–100	Any	Target or Binary Encoding	City, Product sub-category
100–1000	Any	Target, Frequency, or Hashing	ZIP code, Company name
1000+	Any	Embedding or Hash Encoding	User ID, URL, free-text category

🛒 Business use case: Retail demand forecasting

A national grocery chain needed to predict weekly demand for 45,000 SKUs across 800 stores. The “store_id” field had 800 categories and “sku_id” had 45,000. One-hot encoding would have created 45,800 new columns. Instead, the team used target encoding for store_id (mean weekly sales per store) and learned embeddings for sku_id via a neural network. This reduced the feature space by 99.5% while capturing meaningful relationships, cutting training time from 14 hours to 45 minutes.

Feature Engineering

Feature engineering creates new features from existing data to improve model performance. A well-engineered feature can be worth 10× more than a sophisticated algorithm. This is where domain knowledge meets data science.

Numerical feature engineering

Mathematical transformations

Transformation	Formula	Use Case	Business Example
Log	log(x + 1)	Right-skewed data	Income, transaction amounts, page views
Square Root	√x	Count data, reduce skew	Number of support tickets, defect counts
Square	x²	Amplify differences	Distance calculations, penalty terms
Reciprocal	1/x	Rate conversions	Speed → time, frequency → period
Box-Cox	Automatic λ	Normalize any distribution	Any non-normal continuous feature

Binning (discretization)

Convert continuous variables to categories. Methods include equal-width bins, equal-frequency (quantile) bins, and domain-driven bins. Particularly useful for creating interpretable features for business stakeholders.

Polynomial features

Create interaction and power terms: x₁, x₂ → x₁, x₂, x₁², x₂², x₁×x₂. Captures non-linear relationships but creates exponential growth in features. Use with regularization (L1/L2) to prevent overfitting.

Date/time feature engineering

Temporal features are among the most powerful in business applications. Extract meaningful components from timestamps:

Feature	Example	Captures	Business Application
Year	2024	Long-term trends	Revenue forecasting, market growth
Month	6	Seasonality	Retail demand, energy consumption
Day of Week	Monday	Weekly patterns	Call center staffing, ad performance
Hour	14	Daily patterns	Website traffic, trading volume
Is Weekend	True/False	Behavioral differences	E-commerce conversion, support volume
Is Holiday	True/False	Special events	Shipping delays, sales spikes
Days Since Event	45	Recency effects	Customer churn, campaign response
Quarter End	True/False	Business cycles	Sales pipeline, financial reporting

Cyclical encoding tip: For periodic features like month or day of week, use sine/cosine encoding: sin_month = sin(2π × month / 12), cos_month = cos(2π × month / 12). This makes December (12) close to January (1) in feature space.

Text feature engineering

Technique	How It Works	Complexity	Best For
Basic Statistics	Character count, word count, avg word length	Low	Quick signal extraction
Bag of Words	Count each word occurrence	Low	Simple text classification
TF-IDF	Frequency balanced by uniqueness	Medium	Standard text classification, search
Word2Vec / GloVe	Dense word vectors from co-occurrence	Medium	Semantic similarity, analogy tasks
FastText	Sub-word embeddings	Medium	Handling typos, rare words
BERT / Transformers	Contextual sentence embeddings	High	State-of-the-art NLP tasks
LLM Embeddings	GPT-4, Claude embeddings	High	Zero-shot, few-shot classification

Aggregation features

Summarize groups of records to create entity-level features. These are critical in customer analytics, fraud detection, and any domain with transactional data.

Aggregation	Example	Business Signal
Count	Number of transactions per customer	Engagement level
Sum	Total spend per customer per quarter	Customer lifetime value
Mean	Average order value	Purchase behavior
Min / Max	Largest and smallest purchase	Spending range, premium behavior
Std Dev	Variability in transaction amounts	Consistency of behavior
Time Since Last	Days since last purchase	Churn risk indicator
Trend	Month-over-month spend change	Growth or decline trajectory
Ratio	Returns / Total orders	Product satisfaction proxy

✈️ Business use case: Airline customer loyalty prediction

An airline’s data science team engineered 120 features from raw booking and flight data. The top 5 features by importance were all aggregation-based: “total miles flown (12 months)”, “days since last flight”, “ratio of upgrades to total bookings”, “std dev of booking lead time”, and “count of international segments.” These five features alone explained 73% of the variance in loyalty tier transitions, demonstrating the power of thoughtful aggregation over raw transactional data.

Feature Selection

Not all features improve models. Feature selection identifies the most valuable ones to reduce overfitting, improve accuracy, decrease training time, and enhance interpretability. In production systems, fewer features also mean lower serving costs and latency.

Filter methods

Statistical tests independent of the model. Fast to compute but ignore feature interactions.

• Variance Threshold: Remove features with near-zero variance (no signal)

• Correlation Analysis: Remove one of two highly correlated features (r > 0.90)

• Chi-Square Test: Test independence between categorical features and target

• Mutual Information: Measure non-linear shared information between feature and target

• ANOVA F-test: Test whether means of target differ across feature levels

Wrapper methods

Use model performance to select features. More accurate but computationally expensive.

• Forward Selection: Start empty, add the feature that most improves performance each step

• Backward Elimination: Start with all features, remove the least impactful each step

• Recursive Feature Elimination (RFE): Train model, remove least important feature(s), repeat

• Exhaustive Search: Try all possible subsets (only feasible for small feature sets)

Embedded methods

Feature selection built into the training algorithm. Balances accuracy with efficiency.

• L1 Regularization (Lasso): Drives coefficients to exactly zero, performing automatic selection

• Tree Feature Importance: Rank features by their contribution to split quality (Gini / information gain)

• SHAP Values: Model-agnostic importance with theoretical guarantees from game theory

• Permutation Importance: Measure performance drop when a feature’s values are shuffled

Feature selection decision matrix

Scenario	Recommended Approach	Rationale
< 20 features	Backward elimination or RFE	Small enough for wrapper methods
20–100 features	Filter + embedded (Lasso or tree importance)	Balance speed and accuracy
100–1000 features	Filter first, then embedded	Filter removes obvious noise quickly
1000+ features	Variance threshold → correlation → L1	Aggressive reduction needed
Interpretability required	SHAP + domain expert review	Explainability is the priority

Data Splitting

Properly splitting data is critical for unbiased model evaluation. The goal is to simulate how your model will perform on truly unseen data.

Standard splitting strategies

Strategy	How It Works	When to Use
Train / Test (80/20)	Single random split	Large datasets, quick experiments
Train / Val / Test (60/20/20)	Three-way split	Hyperparameter tuning + final evaluation
K-Fold Cross-Validation	K splits, rotate test fold	Moderate datasets, robust estimation
Stratified K-Fold	K-Fold preserving class ratios	Imbalanced classification
Time-Based Split	Train on past, test on future	Time series, forecasting
Group K-Fold	Keep related records together	Multi-record entities (customers, patients)

Splitting best practices

13. Maintain target distribution: Use stratified splitting for classification tasks

14. Respect temporal order: Never randomly split time series data; always use chronological splits

15. Keep groups together: All records for a single customer or patient must be in the same split

16. Hold out test set early: Never touch the test set until final evaluation

17. Document your split: Record random seeds and split logic for reproducibility

Data leakage prevention

Data leakage is the single most common cause of unrealistically high model performance. It occurs when information from outside the training set influences the model.

⚠️ Common leakage sources

1) Fitting scalers/encoders on all data before splitting. 2) Using future information for time series predictions. 3) Target encoding without proper cross-validation folds. 4) Features that directly encode the target (e.g., “approval_date” leaking into a loan approval model). 5) Aggregate features computed across train+test combined.

🏦 Business use case: Credit scoring model leakage

A consumer lending company built a credit risk model with 98% accuracy — suspiciously high. Investigation revealed that a “loan_grade” feature was included in training, which was itself a downstream output of the approval process. Removing this leaked feature dropped accuracy to 81%, which was the model’s true predictive power. The team then focused on engineering legitimate features from application-time data, ultimately reaching 85% accuracy without leakage.

Feature Stores

Feature stores are centralized repositories for ML features that solve critical production challenges: duplicated feature engineering code, training/serving skew, lack of feature discoverability, and inconsistent feature definitions across teams.

Why feature stores matter in production

• Consistency: Same feature definition used in training and real-time serving

• Reusability: Teams share features instead of rebuilding from scratch

• Point-in-time correctness: Historical features are computed as of the prediction time, preventing leakage

• Discoverability: Central catalog of available features with documentation and lineage

• Low-latency serving: Pre-computed features served in milliseconds for real-time predictions

Vendor feature store comparison

Vendor	Service	Offline Store	Online Store	Key Differentiator
AWS	SageMaker Feature Store	S3 + Glue	DynamoDB	Deep SageMaker integration
Google	Vertex AI Feature Store	BigQuery	Bigtable	BigQuery SQL-native
Microsoft	Azure ML Feature Store	ADLS	Redis/Cosmos	MLflow integration
Databricks	Feature Store (Unity Catalog)	Delta Lake	Databricks SQL	Lakehouse-native
Open Source	Feast	Any data warehouse	Redis / DynamoDB	Vendor-agnostic, portable

🚀 Business use case: Real-time recommendation engine

A streaming media company serves 50 million recommendations per day. Their feature store pre-computes 200+ user features (watch history aggregates, genre preferences, session behavior) and 80+ content features (popularity trends, freshness scores, engagement metrics). At serving time, the recommendation model fetches user and content features from the online store in < 10ms, combines them, and returns personalized rankings. Without the feature store, this would require re-computing features from raw event logs for each request — a 30-second operation compressed to 10 milliseconds.

End-to-End Pipeline Reference

This section provides a consolidated view of the complete data preparation pipeline, from raw data to model-ready features, with decision points at each stage.

Step	Key Decision	Methods	Output
1. Collect & Profile	What data do I have?	EDA, profiling reports	Data quality assessment
2. Handle Missing	Why is it missing?	MCAR/MAR/MNAR analysis → imputation strategy	Complete dataset
3. Handle Outliers	Error or valid extreme?	Z-score, IQR, Isolation Forest	Clean or transformed values
4. Transform	What scale do I need?	Normalize, standardize, log-transform	Scaled features
5. Encode	What type and cardinality?	One-hot, label, target, embedding	Numerical features
6. Engineer	What signal is hidden?	Aggregations, temporal, text, interactions	New feature candidates
7. Select	Which features help?	Filter, wrapper, embedded methods	Reduced feature set
8. Split & Validate	Is my evaluation fair?	Stratified, temporal, grouped splits	Train/val/test sets
9. Store & Serve	How to operationalize?	Feature store, pipeline orchestration	Production-ready pipeline

Key Takeaways

1. Data preparation consumes 60–80% of ML project time but determines success or failure

2. EDA is non-negotiable — understand your data deeply before transforming it

3. Handle missing data based on why it’s missing, not just how much is missing

4. Scale features based on your algorithm’s requirements — and always fit on training data only

5. Match encoding strategy to cardinality and variable type — one-hot is not always the answer

6. Feature engineering is where domain knowledge creates competitive advantage

7. Feature selection prevents overfitting and reduces production serving costs

8. Data splitting done wrong invalidates your entire evaluation — especially watch for leakage

9. Feature stores bridge the gap between experimentation and production ML

10. The best practitioners iterate: EDA → transform → model → analyze errors → re-engineer features

Additional Learning Resources

Official vendor documentation

• AWS: docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html

• Google: cloud.google.com/architecture/data-preprocessing-for-ml-with-tf-transform-pt1

• Microsoft: learn.microsoft.com/azure/machine-learning/concept-data

• Salesforce: help.salesforce.com/s/articleView?id=sf.c360_a_data_cloud.htm

• Databricks: docs.databricks.com/machine-learning/feature-store/

Certification preparation

• AWS ML Specialty: aws.amazon.com/certification/certified-machine-learning-specialty/

• Google ML Engineer: cloud.google.com/learn/certification/machine-learning-engineer

• Azure DP-100: learn.microsoft.com/certifications/exams/dp-100

• CompTIA Data+: comptia.org/certifications/data

• Salesforce AI Associate: trailhead.salesforce.com/credentials/ai-associate

Article 3 of 5 | AI/ML Foundations Training Series | Level: Beginner to Intermediate | Estimated Reading Time: 45 minutes | Last Updated: March 2026

Check Your Knowledge

Question #1

A data scientist discovers that 22% of the “vehicle_mileage” field is missing in an automotive insurance dataset. Investigation shows that the missingness correlates with policy type — commercial policies are far less likely to have mileage recorded than personal policies. Which type of missing data does this represent, and what is the MOST appropriate imputation strategy?

A) MCAR — use mean imputation since the missingness is random.

B) MAR — use MICE imputation incorporating policy type and other observed variables as predictors.

C) MNAR — drop all records with missing mileage since the data cannot be recovered.

D) MCAR — use listwise deletion since any imputation method is valid.

Solution

Correct answer: B – Explanation: When missingness depends on observed variables (policy type), the data is Missing at Random (MAR). MICE (Multiple Imputation by Chained Equations) is ideal here because it leverages relationships between observed variables — such as vehicle age, urban/rural indicator, and policy type — to impute missing values iteratively. Mean imputation (A, D) ignores these relationships and distorts the distribution. Dropping all records (C) wastes valuable data and incorrectly classifies the missingness as MNAR.

Question #2

A retail company is building a demand forecasting model. Their dataset includes a “store_id” field with 800 unique values and a “sku_id” field with 45,000 unique values. The team needs to encode both categorical variables for a gradient boosting model.

Which encoding approach is MOST appropriate?

A) One-hot encode both fields to preserve all categorical information.

B) Use label encoding for both fields since gradient boosting is tree-based and scale-invariant.

C) Use target encoding for store_id and learned embeddings for sku_id.

D) Apply frequency encoding to both fields since category counts capture meaningful signal.

Solution

Correct answer: C – Explanation: One-hot encoding (A) would create 45,800 new columns, causing extreme dimensionality and memory issues. Label encoding (B) technically works for tree-based models but wastes the opportunity to inject meaningful signal into the features. Target encoding replaces store_id with mean sales per store — a compact, informative representation for medium cardinality (800). For very high cardinality (45,000 SKUs), learned embeddings capture complex relationships in dense vector form. Frequency encoding (D) risks information loss when different categories share the same count.

Question #3

A machine learning team is preparing features for a K-Nearest Neighbors (KNN) classification model. The dataset contains “annual_income” (range: $20,000–$500,000) and “age” (range: 18–85). The team fits a StandardScaler on the entire dataset before splitting into training and test sets.

What is the PRIMARY problem with this approach?

A) StandardScaler is the wrong choice for KNN — Min-Max scaling should be used instead. B) The features should not be scaled at all since KNN is a distance-based algorithm that benefits from raw magnitudes.

C) Fitting the scaler on all data before splitting introduces data leakage — test set statistics influence the training transformation.

D) The income feature is right-skewed, so a log transform should be applied before any scaling.

Solution

Correct answer: C – Explanation: The critical rule is to fit scalers on training data only, then transform both training and test sets using the training-fitted scaler. When you fit on all data, the scaler’s mean and standard deviation incorporate test set values, allowing information from the test set to “leak” into the training pipeline. This artificially inflates performance metrics. While KNN does require scaling (ruling out B), and both StandardScaler and Min-Max work for KNN (A is not the primary issue), and log transforms may help skewed data (D is secondary), the data leakage from fitting before splitting is the most consequential error.

Question #4

A data engineering team is building a real-time fraud detection system that processes 5 million transactions daily. They need features like “transaction velocity” (transactions per hour per customer) and “geographic anomaly score” available at prediction time with sub-100ms latency. Currently, these features are computed during batch model training from raw event logs.

Which infrastructure component BEST solves the training-serving consistency and latency requirements?

A) A batch ETL pipeline that recomputes all features nightly and stores them in a data warehouse.

B) A feature store with both an offline store for training and an online store for low-latency serving.

C) An in-memory cache that stores the most recent model predictions for each customer. D) A streaming pipeline that recomputes features from raw logs for every incoming transaction at serving time.

Solution

Correct answer: B – Explanation: A feature store solves both problems simultaneously. The offline store (backed by a data warehouse like S3 or BigQuery) provides point-in-time correct features for training, while the online store (backed by a low-latency database like DynamoDB or Redis) serves pre-computed features in milliseconds at prediction time. This ensures the same feature definitions are used in training and serving, eliminating training-serving skew. A nightly batch pipeline (A) cannot deliver sub-100ms latency. Caching predictions (C) does not address feature computation. Recomputing features from raw logs per request (D) would take seconds, not milliseconds, at this transaction volume.

Question #5

A data scientist is performing Exploratory Data Analysis on patient discharge records. They notice that the mean of the “length_of_stay” feature is 6.8 days, but the median is only 3.2 days. The skewness value is 4.1.

Which combination of observations and next steps is MOST accurate?

A) The distribution is left-skewed; apply square transformation to normalize it before modeling.

B) The distribution is approximately normal; the mean and median difference is within acceptable range.

C) The distribution is heavily right-skewed; consider a log transform and investigate outliers using the IQR method.

D) The skewness indicates data quality issues; drop all records above the mean to correct the distribution.

Solution

Correct answer: C – Explanation: When the mean is substantially higher than the median, the distribution is right-skewed — a small number of very long hospital stays are pulling the mean upward. A skewness of 4.1 (well above the |2| threshold) confirms this is heavily right-skewed. The appropriate next steps are to apply a log transform (which compresses right-skewed distributions toward normality) and investigate potential outliers using the IQR method (which is robust to extreme values, unlike Z-scores that assume normality). Left-skewed (A) would show mean < median. The distribution is clearly not normal (B). Dropping records above the mean (D) would discard nearly half the dataset and is not a valid statistical technique.

Choose Your AI Certification Path

Whether you’re exploring AI on Google Cloud, Azure, Salesforce, AWS, or Databricks, PowerKram gives you vendor‑aligned practice exams built from real exam objectives — not dumps.

Start with a free 24‑hour trial for the vendor that matches your goals.

All
Google
AWS
Microsoft
DataBricks
Salesforce

All

See all vendors offering data engineering practice exams.

Professional Machine Learning Engineer

Professional Data Engineer

AWS Certified AI Practioner

AWS Machine Learning Specialist

AWS Machine Learning Engineer – Associate

Microsoft AI-102 Azure AI Engineer Associate

Microsoft AI-900 Azure AI Fundamentals

Best Paid Proxies

March 25, 2026 at 1:16 am

Good day! Do you use Twitter? I’d like to follow you if that would be okay.
I’m undoubtedly enjoying your blog and look forward to new
updates.

Proxies For Scrapebox

March 25, 2026 at 1:59 am

Fine way of telling, and good paragraph to get facts concerning my presentation subject matter,
which i am going to convey in college.

DreamProxies

March 25, 2026 at 2:18 am

Thanks for sharing your thoughts. I really appreciate your efforts and I will be waiting for your next write ups
thank you once again.

Mastering Data Preparation and Feature Engineering for Machine Learning

Introduction

Business use case: E-commerce personalization at scale

Understanding the ML data flow

Vendor data pipeline services

Key questions to answer

Statistical summary for numerical features

Visualization techniques

Univariate analysis (single variable)

Bivariate analysis (two variables)

Multivariate analysis

Vendor EDA tools

Types of missing data

Missing Completely at Random (MCAR)

Missing at Random (MAR)

Missing Not at Random (MNAR)

Decision framework for missing data

Simple imputation methods

Advanced imputation methods

Detecting outliers

Z-Score method

Interquartile Range (IQR) method

Isolation Forest (advanced)

Handling strategies

Scaling numerical features

When to scale (algorithm cheat sheet)

Encoding Categorical Variables

Types of categorical variables

Encoding techniques comparison

Encoding decision guide

Numerical feature engineering

Mathematical transformations

Binning (discretization)

Polynomial features

Date/time feature engineering

Text feature engineering

Aggregation features

Filter methods

Wrapper methods

Embedded methods

Feature selection decision matrix

Standard splitting strategies

Splitting best practices

Data leakage prevention

Why feature stores matter in production

Vendor feature store comparison

Additional Learning Resources

Official vendor documentation

Certification preparation

Check Your Knowledge

Choose Your AI Certification Path

All

Professional Machine Learning Engineer

Professional Data Engineer

AWS Certified AI Practioner

AWS Machine Learning Specialist

AWS Machine Learning Engineer – Associate

Microsoft AI-102 Azure AI Engineer Associate

Microsoft AI-900 Azure AI Fundamentals

3 thoughts on “Data Preparation and Feature Engineering”

Leave a Comment Cancel Reply