Mastering Data Preparation and Feature Engineering for Machine Learning
A Cross-Vendor Training Guide | by Synchronized Software L.L.C. | 1/22/2026
Aligned to: Certification Alignment: AWS ML Specialty, Google ML Engineer, Azure DP-100, CompTIA Data+, Salesforce Agentforce Specialist
Introduction
Data preparation and feature engineering often consume 60–80% of a machine learning project’s time. Yet this work is what separates successful ML projects from failures. As the saying goes: “Garbage in, garbage out.”
This guide covers essential techniques for preparing data and engineering features that will dramatically improve your model performance across all major cloud platforms. Whether you are preparing for certification or building production ML systems, the techniques in this article are universally applicable.
⚡ Why This Matters
According to a 2024 Anaconda survey, data scientists spend an average of 45% of their time on data preparation alone. Organizations that invest in systematic data preparation practices see 3–5x improvements in model accuracy compared to those that skip directly to model training.
Business use case: E-commerce personalization at scale
Consider a mid-size e-commerce company with 2 million customers and 50,000 SKUs. Their raw data includes transaction logs, clickstream events, product catalog data, customer profiles, and seasonal marketing campaigns. Before any recommendation model can be built, the data team must:
1. Deduplicate customer records across mobile app and web sessions (data cleaning)
2. Impute missing product categories where catalog data is incomplete (missing data handling)
3. Normalize price fields across multiple currencies (data transformation)
4. Engineer features like “average basket size”, “days since last purchase”, and “category affinity score” (feature engineering)
5. Select the top 50 features from 300+ candidates to avoid overfitting (feature selection)
Without this preparation pipeline, the recommendation model would produce irrelevant suggestions, damaging customer trust and reducing conversion rates.
The Data Pipeline
Understanding the ML data flow
The ML data pipeline follows a predictable sequence from raw data to model-ready features. Understanding each stage helps you plan your work and estimate timelines accurately.
|
Stage |
Challenge |
Key Activities |
Typical Time |
|
Collection |
Data scattered across sources |
ETL, API integration, streaming |
10–15% |
|
Cleaning |
Missing values, errors, duplicates |
Imputation, validation, deduplication |
20–30% |
|
Transformation |
Wrong format, scale, encoding |
Type conversion, scaling, encoding |
15–20% |
|
Feature Engineering |
Raw data ≠ useful features |
Creation, selection, extraction |
20–30% |
|
Validation |
Quality assurance |
Distribution checks, schema tests |
5–10% |
Vendor data pipeline services
|
Vendor |
Service |
Purpose |
Best For |
|
AWS |
Glue + SageMaker Data Wrangler |
ETL, visual data preparation |
Large-scale batch processing |
|
|
Dataflow + Vertex AI Pipelines |
Stream/batch processing |
Real-time ML pipelines |
|
Microsoft |
Azure Data Factory + Azure ML |
Data integration, preparation |
Enterprise data estates |
|
Salesforce |
Data Cloud + MuleSoft |
CRM data integration |
Customer 360 and AI-powered CRM |
|
Databricks |
Delta Lake + Feature Store |
Unified analytics |
Lakehouse architecture |
🏢 Business use case: Financial fraud detection
A regional bank processes 5 million transactions daily. Their fraud detection pipeline ingests transaction data from core banking, card processor feeds, and customer behavior logs. AWS Glue handles ETL from 12 source systems, SageMaker Data Wrangler provides visual data quality analysis, and the Feature Store serves real-time features like “transaction velocity” and “geographic anomaly score” to the fraud model with sub-100ms latency.
Exploratory Data Analysis (EDA)
Before any data preparation, understand your data through EDA. Skipping EDA is the most common mistake made by junior data scientists — it leads to wasted effort on irrelevant transformations and missed insights about data quality.
Key questions to answer
1. What is the shape of the data? — Rows, columns, memory usage
2. What are the data types? — Numeric, categorical, datetime, text
3. What is the distribution? — Mean, median, mode, variance, skewness
4. Are there missing values? — Count, percentage, patterns (MCAR, MAR, MNAR)
5. Are there outliers? — Extreme values that could affect modeling
6. What are the relationships? — Correlations, interactions, multicollinearity
7. What is the target distribution? — Balanced or imbalanced classes
Statistical summary for numerical features
|
Statistic |
What It Tells You |
When to Worry |
|
Count |
Number of non-null values |
Differs significantly between features |
|
Mean |
Central tendency (sensitive to outliers) |
Very different from median |
|
Median |
Central tendency (robust to outliers) |
Doesn’t match business expectations |
|
Std Dev |
Spread of values |
Very large relative to mean |
|
Min/Max |
Range of values |
Physically impossible values |
|
Quartiles |
Distribution shape |
Large gap between Q3 and Max |
|
Skewness |
Asymmetry of distribution |
|Skewness| > 2 (highly skewed) |
Visualization techniques
Univariate analysis (single variable)
• Histograms: Distribution shape, bin width selection, normality assessment
• Box plots: Median, quartiles, outlier identification in one view
• Bar charts: Category frequencies, class imbalance detection
• KDE plots: Smooth density estimation for continuous variables
Bivariate analysis (two variables)
• Scatter plots: Linear and non-linear relationships, clusters
• Correlation heatmaps: Feature-to-feature and feature-to-target relationships
• Grouped bar charts: Categorical comparisons across segments
• Violin plots: Distribution comparison across categories
Multivariate analysis
• Pair plots: All pairwise relationships at a glance
• Parallel coordinates: High-dimensional pattern discovery
• t-SNE / UMAP: Non-linear dimensionality reduction for cluster visualization
• Andrews curves: Multivariate data represented as curves for pattern recognition
🏥 Business use case: Healthcare patient readmission prediction
A hospital network analyzed EDA results on 250,000 patient discharge records and discovered that 34% of “length of stay” values were missing. Further investigation revealed the missingness was MAR — patients transferred to other facilities had systematically missing stay durations. This insight led the team to use transfer status as a predictor variable and apply regression-based imputation, improving their readmission model’s AUC from 0.72 to 0.81.
Vendor EDA tools
|
Vendor |
Tool |
Key Capability |
|
AWS |
SageMaker Data Wrangler |
300+ built-in analyses, bias detection |
|
|
Vertex AI Workbench |
Jupyter-native, BigQuery integration |
|
Microsoft |
Azure ML Studio |
Automated profiling, drift detection |
|
Salesforce |
Tableau + Einstein Discovery |
Visual analytics with AI-powered insights |
|
Open Source |
Pandas Profiling / ydata-profiling |
One-line comprehensive EDA reports |
Handling Missing Data
Missing values are ubiquitous in real-world data and must be addressed before modeling. The strategy you choose should be informed by why the data is missing, not just how much is missing.
Types of missing data
Missing Completely at Random (MCAR)
Missingness is independent of all variables. Example: Random sensor failures in IoT data. Safe to use any imputation method; listwise deletion is valid.
Missing at Random (MAR)
Missingness depends on observed variables. Example: Younger survey respondents less likely to report income. Use observed variables to inform imputation; multiple imputation recommended.
Missing Not at Random (MNAR)
Missingness depends on the missing value itself. Example: High earners choosing not to report income. Most challenging; may need domain expertise, sensitivity analysis, or specialized models.
Decision framework for missing data
|
% Missing |
Recommendation |
Methods |
Risk Level |
|
< 5% |
Usually safe to impute |
Mean, median, mode |
Low |
|
5–15% |
Impute + create missing indicator feature |
KNN, regression, MICE |
Medium |
|
15–30% |
Advanced imputation; validate impact |
Multiple imputation, MICE |
High |
|
> 30% |
Consider dropping or domain-specific approach |
Domain rules, model-based |
Very High |
Simple imputation methods
|
Method |
When to Use |
Advantages |
Disadvantages |
|
Mean |
Numerical, symmetric |
Simple, preserves mean |
Reduces variance, distorts distribution |
|
Median |
Numerical, skewed |
Robust to outliers |
May not capture relationships |
|
Mode |
Categorical |
Preserves most common value |
May create artificial peak |
|
Constant |
Domain-specific |
Explicit, interpretable |
Requires domain knowledge |
|
Forward/Back Fill |
Time series |
Preserves temporal order |
Can propagate errors |
Advanced imputation methods
|
Method |
Description |
When to Use |
Complexity |
|
KNN Imputation |
Impute based on K similar records |
Multivariate relationships matter |
Medium |
|
Regression |
Predict missing from other features |
Strong linear relationships |
Medium |
|
Multiple Imputation |
Create M imputed datasets, pool results |
Uncertainty quantification needed |
High |
|
MICE |
Iterative chained equations |
Multiple columns with missing values |
High |
|
Deep Learning |
Autoencoders, GANs for imputation |
Complex non-linear patterns |
Very High |
🚗 Business use case: Automotive insurance risk scoring
An insurance company discovered that 22% of their “vehicle mileage” field was missing. Using MICE imputation — incorporating vehicle age, urban/rural indicator, and policy type as predictors — they recovered this critical risk variable. The imputed mileage feature became the third most important predictor in their risk model, improving loss ratio predictions by 12% compared to simply dropping the records.
Handling Outliers
Outliers are data points significantly different from other observations. They can be legitimate extreme values, data errors, or indicators of rare but important phenomena.
Detecting outliers
Z-Score method
Formula: z = (x – μ) / σ — Flag as outlier if |z| > 3. Assumes normal distribution; sensitive to extreme outliers themselves.
Interquartile Range (IQR) method
Formula: IQR = Q3 – Q1 — Outlier if x < Q1 – 1.5×IQR or x > Q3 + 1.5×IQR. Robust to extreme values; works for skewed distributions.
Isolation Forest (advanced)
An unsupervised ensemble method that isolates outliers by randomly partitioning the feature space. Effective for high-dimensional data where simple statistical methods fail.
Handling strategies
|
Strategy |
Description |
When to Use |
Impact |
|
Remove |
Delete outlier rows |
Confirmed errors or different population |
Reduces dataset size |
|
Cap / Winsorize |
Replace with boundary values (e.g., 1st/99th percentile) |
Reduce influence while keeping data |
Preserves record count |
|
Transform |
Log, square root, Box-Cox |
Reduce skewness |
Changes distribution shape |
|
Bin |
Convert to categories |
When exact value is less important |
Loses granularity |
|
Separate Model |
Build specific model for outlier segment |
Outliers represent a valid sub-population |
More complex pipeline |
|
Keep |
Do nothing |
Valid data; using robust algorithms (trees) |
No action needed |
Data Transformation
Transform data into formats suitable for machine learning algorithms. The choice of transformation depends on your algorithm and the characteristics of your features.
Scaling numerical features
Most ML algorithms perform better when features are on similar scales. Distance-based and gradient-based algorithms are particularly sensitive to feature scale.
|
Technique |
Formula |
Range |
Best For |
|
Min-Max Scaling |
(x – min) / (max – min) |
[0, 1] |
Neural networks, KNN, image data |
|
Standardization (Z-score) |
(x – mean) / std |
~[–3, 3] |
Linear models, SVM, PCA |
|
Robust Scaling |
(x – median) / IQR |
Varies |
Data with outliers |
|
Log Transform |
log(x + 1) |
Varies |
Right-skewed distributions |
|
Power Transform (Box-Cox) |
Automatic λ selection |
Varies |
Any non-normal distribution |
|
Quantile Transform |
Map to uniform/normal |
[0, 1] or ~[–3, 3] |
Non-parametric normalization |
When to scale (algorithm cheat sheet)
|
Algorithm |
Scaling Required? |
Recommended Method |
Reason |
|
Linear / Logistic Regression |
Yes |
Standardization |
Gradient descent convergence |
|
SVM |
Yes |
Standardization |
Distance-based kernel |
|
K-Means, KNN |
Yes |
Either |
Distance-based |
|
Decision Trees / Random Forest |
No |
N/A |
Split-based (scale invariant) |
|
Gradient Boosting (XGBoost) |
No |
N/A |
Tree-based |
|
Neural Networks |
Yes |
Min-Max or Standardization |
Activation function ranges |
|
PCA |
Yes |
Standardization |
Variance-based decomposition |
⚠️ Critical rule
Always fit the scaler on training data only, then transform both training and test sets using the training scaler. Fitting on all data causes data leakage — the scaler “sees” test set statistics, artificially inflating performance metrics.
Encoding Categorical Variables
Machine learning algorithms require numerical input. Categorical variables must be encoded, and the choice of encoding method significantly impacts model performance.
Types of categorical variables
• Nominal — No inherent order. Examples: colors (Red, Blue, Green), country, product category.
• Ordinal — Natural order exists. Examples: size (S < M < L), education level, satisfaction rating.
Encoding techniques comparison
|
Technique |
How It Works |
Best For |
Watch Out For |
|
One-Hot |
Binary column per category |
Nominal, < 15 categories |
High dimensionality (curse of dimensionality) |
|
Label Encoding |
Integer per category |
Ordinal, tree-based models |
Implies false ordinality for nominal vars |
|
Target (Mean) Encoding |
Replace with target mean |
High cardinality (100+) |
Overfitting, data leakage risk |
|
Frequency Encoding |
Replace with category count |
When frequency is meaningful |
Different categories with same count |
|
Binary Encoding |
Label → binary digits |
Medium cardinality (10–100) |
Less interpretable |
|
Embedding |
Learned dense vectors |
Very high cardinality (1000+) |
Requires neural network training |
|
Hash Encoding |
Hash function to fixed dims |
Extremely high cardinality |
Hash collisions |
Encoding decision guide
|
Cardinality |
Variable Type |
Recommended Encoding |
Example |
|
2 (binary) |
Any |
Binary (0/1) |
Gender, Yes/No flags |
|
3–10 |
Nominal |
One-Hot Encoding |
Color, Region, Department |
|
3–10 |
Ordinal |
Label Encoding (ordered) |
Size, Rating, Education |
|
11–100 |
Any |
Target or Binary Encoding |
City, Product sub-category |
|
100–1000 |
Any |
Target, Frequency, or Hashing |
ZIP code, Company name |
|
1000+ |
Any |
Embedding or Hash Encoding |
User ID, URL, free-text category |
🛒 Business use case: Retail demand forecasting
A national grocery chain needed to predict weekly demand for 45,000 SKUs across 800 stores. The “store_id” field had 800 categories and “sku_id” had 45,000. One-hot encoding would have created 45,800 new columns. Instead, the team used target encoding for store_id (mean weekly sales per store) and learned embeddings for sku_id via a neural network. This reduced the feature space by 99.5% while capturing meaningful relationships, cutting training time from 14 hours to 45 minutes.
Feature Engineering
Feature engineering creates new features from existing data to improve model performance. A well-engineered feature can be worth 10× more than a sophisticated algorithm. This is where domain knowledge meets data science.
Numerical feature engineering
Mathematical transformations
|
Transformation |
Formula |
Use Case |
Business Example |
|
Log |
log(x + 1) |
Right-skewed data |
Income, transaction amounts, page views |
|
Square Root |
√x |
Count data, reduce skew |
Number of support tickets, defect counts |
|
Square |
x² |
Amplify differences |
Distance calculations, penalty terms |
|
Reciprocal |
1/x |
Rate conversions |
Speed → time, frequency → period |
|
Box-Cox |
Automatic λ |
Normalize any distribution |
Any non-normal continuous feature |
Binning (discretization)
Convert continuous variables to categories. Methods include equal-width bins, equal-frequency (quantile) bins, and domain-driven bins. Particularly useful for creating interpretable features for business stakeholders.
Polynomial features
Create interaction and power terms: x₁, x₂ → x₁, x₂, x₁², x₂², x₁×x₂. Captures non-linear relationships but creates exponential growth in features. Use with regularization (L1/L2) to prevent overfitting.
Date/time feature engineering
Temporal features are among the most powerful in business applications. Extract meaningful components from timestamps:
|
Feature |
Example |
Captures |
Business Application |
|
Year |
2024 |
Long-term trends |
Revenue forecasting, market growth |
|
Month |
6 |
Seasonality |
Retail demand, energy consumption |
|
Day of Week |
Monday |
Weekly patterns |
Call center staffing, ad performance |
|
Hour |
14 |
Daily patterns |
Website traffic, trading volume |
|
Is Weekend |
True/False |
Behavioral differences |
E-commerce conversion, support volume |
|
Is Holiday |
True/False |
Special events |
Shipping delays, sales spikes |
|
Days Since Event |
45 |
Recency effects |
Customer churn, campaign response |
|
Quarter End |
True/False |
Business cycles |
Sales pipeline, financial reporting |
Cyclical encoding tip: For periodic features like month or day of week, use sine/cosine encoding: sin_month = sin(2π × month / 12), cos_month = cos(2π × month / 12). This makes December (12) close to January (1) in feature space.
Text feature engineering
|
Technique |
How It Works |
Complexity |
Best For |
|
Basic Statistics |
Character count, word count, avg word length |
Low |
Quick signal extraction |
|
Bag of Words |
Count each word occurrence |
Low |
Simple text classification |
|
TF-IDF |
Frequency balanced by uniqueness |
Medium |
Standard text classification, search |
|
Word2Vec / GloVe |
Dense word vectors from co-occurrence |
Medium |
Semantic similarity, analogy tasks |
|
FastText |
Sub-word embeddings |
Medium |
Handling typos, rare words |
|
BERT / Transformers |
Contextual sentence embeddings |
High |
State-of-the-art NLP tasks |
|
LLM Embeddings |
GPT-4, Claude embeddings |
High |
Zero-shot, few-shot classification |
Aggregation features
Summarize groups of records to create entity-level features. These are critical in customer analytics, fraud detection, and any domain with transactional data.
|
Aggregation |
Example |
Business Signal |
|
Count |
Number of transactions per customer |
Engagement level |
|
Sum |
Total spend per customer per quarter |
Customer lifetime value |
|
Mean |
Average order value |
Purchase behavior |
|
Min / Max |
Largest and smallest purchase |
Spending range, premium behavior |
|
Std Dev |
Variability in transaction amounts |
Consistency of behavior |
|
Time Since Last |
Days since last purchase |
Churn risk indicator |
|
Trend |
Month-over-month spend change |
Growth or decline trajectory |
|
Ratio |
Returns / Total orders |
Product satisfaction proxy |
✈️ Business use case: Airline customer loyalty prediction
An airline’s data science team engineered 120 features from raw booking and flight data. The top 5 features by importance were all aggregation-based: “total miles flown (12 months)”, “days since last flight”, “ratio of upgrades to total bookings”, “std dev of booking lead time”, and “count of international segments.” These five features alone explained 73% of the variance in loyalty tier transitions, demonstrating the power of thoughtful aggregation over raw transactional data.
Feature Selection
Not all features improve models. Feature selection identifies the most valuable ones to reduce overfitting, improve accuracy, decrease training time, and enhance interpretability. In production systems, fewer features also mean lower serving costs and latency.
Filter methods
Statistical tests independent of the model. Fast to compute but ignore feature interactions.
• Variance Threshold: Remove features with near-zero variance (no signal)
• Correlation Analysis: Remove one of two highly correlated features (r > 0.90)
• Chi-Square Test: Test independence between categorical features and target
• Mutual Information: Measure non-linear shared information between feature and target
• ANOVA F-test: Test whether means of target differ across feature levels
Wrapper methods
Use model performance to select features. More accurate but computationally expensive.
• Forward Selection: Start empty, add the feature that most improves performance each step
• Backward Elimination: Start with all features, remove the least impactful each step
• Recursive Feature Elimination (RFE): Train model, remove least important feature(s), repeat
• Exhaustive Search: Try all possible subsets (only feasible for small feature sets)
Embedded methods
Feature selection built into the training algorithm. Balances accuracy with efficiency.
• L1 Regularization (Lasso): Drives coefficients to exactly zero, performing automatic selection
• Tree Feature Importance: Rank features by their contribution to split quality (Gini / information gain)
• SHAP Values: Model-agnostic importance with theoretical guarantees from game theory
• Permutation Importance: Measure performance drop when a feature’s values are shuffled
Feature selection decision matrix
|
Scenario |
Recommended Approach |
Rationale |
|
< 20 features |
Backward elimination or RFE |
Small enough for wrapper methods |
|
20–100 features |
Filter + embedded (Lasso or tree importance) |
Balance speed and accuracy |
|
100–1000 features |
Filter first, then embedded |
Filter removes obvious noise quickly |
|
1000+ features |
Variance threshold → correlation → L1 |
Aggressive reduction needed |
|
Interpretability required |
SHAP + domain expert review |
Explainability is the priority |
Data Splitting
Properly splitting data is critical for unbiased model evaluation. The goal is to simulate how your model will perform on truly unseen data.
Standard splitting strategies
|
Strategy |
How It Works |
When to Use |
|
Train / Test (80/20) |
Single random split |
Large datasets, quick experiments |
|
Train / Val / Test (60/20/20) |
Three-way split |
Hyperparameter tuning + final evaluation |
|
K-Fold Cross-Validation |
K splits, rotate test fold |
Moderate datasets, robust estimation |
|
Stratified K-Fold |
K-Fold preserving class ratios |
Imbalanced classification |
|
Time-Based Split |
Train on past, test on future |
Time series, forecasting |
|
Group K-Fold |
Keep related records together |
Multi-record entities (customers, patients) |
Splitting best practices
13. Maintain target distribution: Use stratified splitting for classification tasks
14. Respect temporal order: Never randomly split time series data; always use chronological splits
15. Keep groups together: All records for a single customer or patient must be in the same split
16. Hold out test set early: Never touch the test set until final evaluation
17. Document your split: Record random seeds and split logic for reproducibility
Data leakage prevention
Data leakage is the single most common cause of unrealistically high model performance. It occurs when information from outside the training set influences the model.
⚠️ Common leakage sources
1) Fitting scalers/encoders on all data before splitting. 2) Using future information for time series predictions. 3) Target encoding without proper cross-validation folds. 4) Features that directly encode the target (e.g., “approval_date” leaking into a loan approval model). 5) Aggregate features computed across train+test combined.
🏦 Business use case: Credit scoring model leakage
A consumer lending company built a credit risk model with 98% accuracy — suspiciously high. Investigation revealed that a “loan_grade” feature was included in training, which was itself a downstream output of the approval process. Removing this leaked feature dropped accuracy to 81%, which was the model’s true predictive power. The team then focused on engineering legitimate features from application-time data, ultimately reaching 85% accuracy without leakage.
Feature Stores
Feature stores are centralized repositories for ML features that solve critical production challenges: duplicated feature engineering code, training/serving skew, lack of feature discoverability, and inconsistent feature definitions across teams.
Why feature stores matter in production
• Consistency: Same feature definition used in training and real-time serving
• Reusability: Teams share features instead of rebuilding from scratch
• Point-in-time correctness: Historical features are computed as of the prediction time, preventing leakage
• Discoverability: Central catalog of available features with documentation and lineage
• Low-latency serving: Pre-computed features served in milliseconds for real-time predictions
Vendor feature store comparison
|
Vendor |
Service |
Offline Store |
Online Store |
Key Differentiator |
|
AWS |
SageMaker Feature Store |
S3 + Glue |
DynamoDB |
Deep SageMaker integration |
|
|
Vertex AI Feature Store |
BigQuery |
Bigtable |
BigQuery SQL-native |
|
Microsoft |
Azure ML Feature Store |
ADLS |
Redis/Cosmos |
MLflow integration |
|
Databricks |
Feature Store (Unity Catalog) |
Delta Lake |
Databricks SQL |
Lakehouse-native |
|
Open Source |
Feast |
Any data warehouse |
Redis / DynamoDB |
Vendor-agnostic, portable |
🚀 Business use case: Real-time recommendation engine
A streaming media company serves 50 million recommendations per day. Their feature store pre-computes 200+ user features (watch history aggregates, genre preferences, session behavior) and 80+ content features (popularity trends, freshness scores, engagement metrics). At serving time, the recommendation model fetches user and content features from the online store in < 10ms, combines them, and returns personalized rankings. Without the feature store, this would require re-computing features from raw event logs for each request — a 30-second operation compressed to 10 milliseconds.
End-to-End Pipeline Reference
This section provides a consolidated view of the complete data preparation pipeline, from raw data to model-ready features, with decision points at each stage.
|
Step |
Key Decision |
Methods |
Output |
|
1. Collect & Profile |
What data do I have? |
EDA, profiling reports |
Data quality assessment |
|
2. Handle Missing |
Why is it missing? |
MCAR/MAR/MNAR analysis → imputation strategy |
Complete dataset |
|
3. Handle Outliers |
Error or valid extreme? |
Z-score, IQR, Isolation Forest |
Clean or transformed values |
|
4. Transform |
What scale do I need? |
Normalize, standardize, log-transform |
Scaled features |
|
5. Encode |
What type and cardinality? |
One-hot, label, target, embedding |
Numerical features |
|
6. Engineer |
What signal is hidden? |
Aggregations, temporal, text, interactions |
New feature candidates |
|
7. Select |
Which features help? |
Filter, wrapper, embedded methods |
Reduced feature set |
|
8. Split & Validate |
Is my evaluation fair? |
Stratified, temporal, grouped splits |
Train/val/test sets |
|
9. Store & Serve |
How to operationalize? |
Feature store, pipeline orchestration |
Production-ready pipeline |
Key Takeaways
1. Data preparation consumes 60–80% of ML project time but determines success or failure
2. EDA is non-negotiable — understand your data deeply before transforming it
3. Handle missing data based on why it’s missing, not just how much is missing
4. Scale features based on your algorithm’s requirements — and always fit on training data only
5. Match encoding strategy to cardinality and variable type — one-hot is not always the answer
6. Feature engineering is where domain knowledge creates competitive advantage
7. Feature selection prevents overfitting and reduces production serving costs
8. Data splitting done wrong invalidates your entire evaluation — especially watch for leakage
9. Feature stores bridge the gap between experimentation and production ML
10. The best practitioners iterate: EDA → transform → model → analyze errors → re-engineer features
Additional Learning Resources
Official vendor documentation
• AWS: docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html
• Google: cloud.google.com/architecture/data-preprocessing-for-ml-with-tf-transform-pt1
• Microsoft: learn.microsoft.com/azure/machine-learning/concept-data
• Salesforce: help.salesforce.com/s/articleView?id=sf.c360_a_data_cloud.htm
• Databricks: docs.databricks.com/machine-learning/feature-store/
Certification preparation
• AWS ML Specialty: aws.amazon.com/certification/certified-machine-learning-specialty/
• Google ML Engineer: cloud.google.com/learn/certification/machine-learning-engineer
• Azure DP-100: learn.microsoft.com/certifications/exams/dp-100
• CompTIA Data+: comptia.org/certifications/data
• Salesforce AI Associate: trailhead.salesforce.com/credentials/ai-associate
Article 3 of 5 | AI/ML Foundations Training Series | Level: Beginner to Intermediate | Estimated Reading Time: 45 minutes | Last Updated: March 2026
Check Your Knowledge
Question #1
A data scientist discovers that 22% of the “vehicle_mileage” field is missing in an automotive insurance dataset. Investigation shows that the missingness correlates with policy type — commercial policies are far less likely to have mileage recorded than personal policies. Which type of missing data does this represent, and what is the MOST appropriate imputation strategy?
A) MCAR — use mean imputation since the missingness is random.
B) MAR — use MICE imputation incorporating policy type and other observed variables as predictors.
C) MNAR — drop all records with missing mileage since the data cannot be recovered.
D) MCAR — use listwise deletion since any imputation method is valid.
Solution
Correct answer: B – Explanation: When missingness depends on observed variables (policy type), the data is Missing at Random (MAR). MICE (Multiple Imputation by Chained Equations) is ideal here because it leverages relationships between observed variables — such as vehicle age, urban/rural indicator, and policy type — to impute missing values iteratively. Mean imputation (A, D) ignores these relationships and distorts the distribution. Dropping all records (C) wastes valuable data and incorrectly classifies the missingness as MNAR.
Question #2
A retail company is building a demand forecasting model. Their dataset includes a “store_id” field with 800 unique values and a “sku_id” field with 45,000 unique values. The team needs to encode both categorical variables for a gradient boosting model.
Which encoding approach is MOST appropriate?
A) One-hot encode both fields to preserve all categorical information.
B) Use label encoding for both fields since gradient boosting is tree-based and scale-invariant.
C) Use target encoding for store_id and learned embeddings for sku_id.
D) Apply frequency encoding to both fields since category counts capture meaningful signal.
Solution
Correct answer: C – Explanation: One-hot encoding (A) would create 45,800 new columns, causing extreme dimensionality and memory issues. Label encoding (B) technically works for tree-based models but wastes the opportunity to inject meaningful signal into the features. Target encoding replaces store_id with mean sales per store — a compact, informative representation for medium cardinality (800). For very high cardinality (45,000 SKUs), learned embeddings capture complex relationships in dense vector form. Frequency encoding (D) risks information loss when different categories share the same count.
Question #3
A machine learning team is preparing features for a K-Nearest Neighbors (KNN) classification model. The dataset contains “annual_income” (range: $20,000–$500,000) and “age” (range: 18–85). The team fits a StandardScaler on the entire dataset before splitting into training and test sets.
What is the PRIMARY problem with this approach?
A) StandardScaler is the wrong choice for KNN — Min-Max scaling should be used instead. B) The features should not be scaled at all since KNN is a distance-based algorithm that benefits from raw magnitudes.
C) Fitting the scaler on all data before splitting introduces data leakage — test set statistics influence the training transformation.
D) The income feature is right-skewed, so a log transform should be applied before any scaling.
Solution
Correct answer: C – Explanation: The critical rule is to fit scalers on training data only, then transform both training and test sets using the training-fitted scaler. When you fit on all data, the scaler’s mean and standard deviation incorporate test set values, allowing information from the test set to “leak” into the training pipeline. This artificially inflates performance metrics. While KNN does require scaling (ruling out B), and both StandardScaler and Min-Max work for KNN (A is not the primary issue), and log transforms may help skewed data (D is secondary), the data leakage from fitting before splitting is the most consequential error.
Question #4
A data engineering team is building a real-time fraud detection system that processes 5 million transactions daily. They need features like “transaction velocity” (transactions per hour per customer) and “geographic anomaly score” available at prediction time with sub-100ms latency. Currently, these features are computed during batch model training from raw event logs.
Which infrastructure component BEST solves the training-serving consistency and latency requirements?
A) A batch ETL pipeline that recomputes all features nightly and stores them in a data warehouse.
B) A feature store with both an offline store for training and an online store for low-latency serving.
C) An in-memory cache that stores the most recent model predictions for each customer. D) A streaming pipeline that recomputes features from raw logs for every incoming transaction at serving time.
Solution
Correct answer: B – Explanation: A feature store solves both problems simultaneously. The offline store (backed by a data warehouse like S3 or BigQuery) provides point-in-time correct features for training, while the online store (backed by a low-latency database like DynamoDB or Redis) serves pre-computed features in milliseconds at prediction time. This ensures the same feature definitions are used in training and serving, eliminating training-serving skew. A nightly batch pipeline (A) cannot deliver sub-100ms latency. Caching predictions (C) does not address feature computation. Recomputing features from raw logs per request (D) would take seconds, not milliseconds, at this transaction volume.
Question #5
A data scientist is performing Exploratory Data Analysis on patient discharge records. They notice that the mean of the “length_of_stay” feature is 6.8 days, but the median is only 3.2 days. The skewness value is 4.1.
Which combination of observations and next steps is MOST accurate?
A) The distribution is left-skewed; apply square transformation to normalize it before modeling.
B) The distribution is approximately normal; the mean and median difference is within acceptable range.
C) The distribution is heavily right-skewed; consider a log transform and investigate outliers using the IQR method.
D) The skewness indicates data quality issues; drop all records above the mean to correct the distribution.
Solution
Correct answer: C – Explanation: When the mean is substantially higher than the median, the distribution is right-skewed — a small number of very long hospital stays are pulling the mean upward. A skewness of 4.1 (well above the |2| threshold) confirms this is heavily right-skewed. The appropriate next steps are to apply a log transform (which compresses right-skewed distributions toward normality) and investigate potential outliers using the IQR method (which is robust to extreme values, unlike Z-scores that assume normality). Left-skewed (A) would show mean < median. The distribution is clearly not normal (B). Dropping records above the mean (D) would discard nearly half the dataset and is not a valid statistical technique.
Choose Your AI Certification Path
Whether you’re exploring AI on Google Cloud, Azure, Salesforce, AWS, or Databricks, PowerKram gives you vendor‑aligned practice exams built from real exam objectives — not dumps.
Start with a free 24‑hour trial for the vendor that matches your goals.
- All
- AWS
- Microsoft
- DataBricks
- Salesforce





Good day! Do you use Twitter? I’d like to follow you if that would be okay.
I’m undoubtedly enjoying your blog and look forward to new
updates.
Fine way of telling, and good paragraph to get facts concerning my presentation subject matter,
which i am going to convey in college.
Thanks for sharing your thoughts. I really appreciate your efforts and I will be waiting for your next write ups
thank you once again.