Data Fundamentals
Data types, cleaning, preprocessing, feature engineering, and data pipelines.
Every AI model is only as good as the data it learns from. "Garbage in, garbage out" is the most important principle in machine learning. This module covers the practical fundamentals of working with data — from understanding data types and cleaning messy datasets to engineering useful features and building reliable data pipelines.
Why Data Is the Foundation
In the AI world, there's a common saying: "The team with the best data usually wins, not the team with the best algorithm." A simple model with excellent data will almost always outperform a sophisticated model with poor data. Here's why data quality matters so much:
- Models learn patterns from data. If the data contains errors, the model learns those errors as truth
- Bias in data becomes bias in predictions. If training data underrepresents certain groups, the model will perform poorly on those groups
- Missing context leads to wrong conclusions. A model predicting house prices without location data will give meaningless results
- Data quantity matters, but quality matters more. 10,000 clean, well-labeled examples often beat 1,000,000 noisy ones
Types of Data
Before you can work with data, you need to understand the different forms it takes. The type of data determines what tools and techniques you'll use.
| Type | Description | Examples | Common Formats |
|---|---|---|---|
| Structured | Organized in rows and columns with defined types | Spreadsheets, databases, transaction logs | CSV, SQL databases, Excel |
| Unstructured | No predefined format or schema | Text documents, images, audio, video | PDF, JPG/PNG, MP3, MP4 |
| Semi-structured | Has some organization but isn't fully tabular | JSON API responses, XML feeds, email metadata, log files | JSON, XML, YAML |
Most real-world AI projects combine multiple data types. A customer sentiment analysis project might use structured data (purchase history from a database), unstructured data (support ticket text), and semi-structured data (JSON responses from a review API).
Data Cleaning and Preprocessing
Raw data is almost never ready for machine learning. It contains missing values, outliers, inconsistent formats, and errors. Cleaning and preprocessing is where you fix these issues.
Handling Missing Values
Missing data is ubiquitous. A customer might skip a survey question, a sensor might malfunction, or a form field might be optional. You need a strategy for dealing with it:
| Strategy | When to Use | Risks |
|---|---|---|
| Drop rows | Few missing values (<5%), data is plentiful | May lose valuable data; can introduce bias if missingness is not random |
| Fill with mean/median | Numerical columns, values are roughly normally distributed | Reduces variance; use median if outliers are present |
| Fill with mode | Categorical columns (e.g., country, category) | Can over-represent the most common value |
| Forward/backward fill | Time-series data where adjacent values are relevant | Assumes continuity between time points |
| Flag as missing | Missingness itself might be informative | Adds complexity; model needs to learn what the flag means |
import pandas as pd
import numpy as np
df = pd.read_csv("customers.csv")
# Check for missing values
print(df.isnull().sum())
# Drop rows where target variable is missing
df = df.dropna(subset=["revenue"])
# Fill numerical columns with median
df["age"] = df["age"].fillna(df["age"].median())
# Fill categorical columns with mode
df["country"] = df["country"].fillna(df["country"].mode()[0])
# Create a flag for missing values before filling
df["income_missing"] = df["income"].isnull().astype(int)
df["income"] = df["income"].fillna(df["income"].median())Handling Outliers
Outliers are data points that are dramatically different from the rest. An age of 150 is clearly an error, but a salary of $10 million might be real (just unusual). The key question is always: is this a data error or a legitimate extreme value?
Detection Methods
- Z-score:Values more than 3 standard deviations from the mean. Simple but assumes normal distribution.
- IQR method:Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. More robust to non-normal distributions.
- Visual inspection:Box plots and scatter plots often reveal outliers that statistical methods miss.
Treatment Options
- Remove:If clearly erroneous (negative ages, impossible values).
- Cap (winsorize):Replace extreme values with a percentile threshold (e.g., cap at the 99th percentile).
- Transform:Apply log transformation to reduce the impact of extreme values while keeping them.
- Keep:If the outlier is real and informative, leave it. Some models (tree-based) handle outliers well.
Normalization and Scaling
Many ML algorithms expect features to be on similar scales. If "age" ranges from 0-100 and "income" ranges from 0-10,000,000, the model will be dominated by income simply because the numbers are larger — not because income is more important.
| Method | What It Does | When to Use |
|---|---|---|
| Min-Max Scaling | Scales to range [0, 1]: (x - min) / (max - min) | Neural networks, algorithms sensitive to feature magnitude |
| Standard Scaling (Z-score) | Centers at 0, unit variance: (x - mean) / std | Most ML algorithms, especially linear models and SVMs |
| Log Transformation | Applies log(x) to compress large ranges | Highly skewed data like income, population, or prices |
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standard scaling (most common)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # use same scale as training!
# Min-Max scaling
minmax = MinMaxScaler()
X_train_normalized = minmax.fit_transform(X_train)
X_test_normalized = minmax.transform(X_test)Feature Engineering
Feature engineering is the art of creating new, informative variables from your raw data. Good features can dramatically improve model performance — often more than switching to a better algorithm. It's where domain knowledge meets data science.
Common Feature Engineering Techniques
Date/Time Features
Extract meaningful components from timestamps:
df["day_of_week"] = df["date"].dt.dayofweek
df["month"] = df["date"].dt.month
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["hour"] = df["timestamp"].dt.hour
df["is_business_hours"] = df["hour"].between(9, 17).astype(int)Aggregation Features
Summarize related data to create context:
# Customer-level features from transaction data
customer_features = transactions.groupby("customer_id").agg({
"amount": ["mean", "sum", "count", "std"],
"date": ["min", "max"],
}).reset_index()
# Time since last purchase
customer_features["days_since_last"] = (
pd.Timestamp.now() - customer_features["date"]["max"]
).dt.daysEncoding Categorical Variables
ML models need numbers. Convert categories into numerical features:
# One-hot encoding: create binary columns for each category
df_encoded = pd.get_dummies(df, columns=["color", "size"])
# color_red, color_blue, color_green, size_S, size_M, size_L
# Label encoding: assign integers (for ordinal data)
size_map = {"S": 1, "M": 2, "L": 3, "XL": 4}
df["size_encoded"] = df["size"].map(size_map)Interaction Features
Combine features to capture relationships:
# Price per square foot (combines two features meaningfully)
df["price_per_sqft"] = df["price"] / df["square_feet"]
# Ratio features
df["debt_to_income"] = df["total_debt"] / df["annual_income"]
# Polynomial features for non-linear relationships
df["age_squared"] = df["age"] ** 2Train/Test Splits: Why and How
The most fundamental concept in ML evaluation is the train/test split. You divide your data into two (or three) parts, train your model on one part, and evaluate it on data it has never seen. This is how you measure real-world performance rather than just memorization.
Why Split?
Imagine studying for an exam by memorizing the answer key. You'd get 100% on that specific test — but fail a different test on the same topic. That's what happens when you evaluate a model on its training data: it looks perfect but fails on new data. This is called overfitting. The test set simulates real-world data the model has never seen.
| Split | Typical Size | Purpose |
|---|---|---|
| Training set | 60-80% | Model learns patterns from this data |
| Validation set | 10-20% | Tune hyperparameters and select between models (optional but recommended) |
| Test set | 10-20% | Final, unbiased evaluation — only used once at the end |
from sklearn.model_selection import train_test_split
# Simple train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # reproducible split
stratify=y # maintain class proportions
)
# Three-way split: train / validation / test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.18, random_state=42, stratify=y_temp
)
# Results: ~70% train, ~15% validation, ~15% testWorking with Common Data Formats
In practice, data comes from many sources and formats. Here's how to work with the most common ones:
CSV (Comma-Separated Values)
The most common format for tabular data. Simple, readable, and supported everywhere.
import pandas as pd
# Read CSV
df = pd.read_csv("data.csv")
# Handle common issues
df = pd.read_csv("data.csv",
encoding="utf-8", # handle special characters
na_values=["N/A", ""], # treat these as missing
parse_dates=["date"], # auto-parse date columns
dtype={"zip_code": str} # prevent treating zip codes as numbers
)
# Write CSV
df.to_csv("cleaned_data.csv", index=False)JSON (JavaScript Object Notation)
The standard format for API responses. Nested and flexible, but requires flattening for ML use.
import json
import pandas as pd
# Read JSON
df = pd.read_json("data.json")
# For nested JSON (common with APIs)
with open("nested_data.json") as f:
data = json.load(f)
# Flatten nested structure
df = pd.json_normalize(data["results"], sep="_")
# Fetch from API
import requests
response = requests.get("https://api.example.com/data")
data = response.json()
df = pd.DataFrame(data["items"])SQL Databases
For production data stored in databases like PostgreSQL, MySQL, or SQLite:
import pandas as pd
import sqlite3
# Connect and query
conn = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM customers WHERE active = 1", conn)
conn.close()
# For larger databases, use SQLAlchemy
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@host/dbname")
df = pd.read_sql("SELECT * FROM orders WHERE date > '2025-01-01'", engine)Data Pipelines
A data pipeline is an automated sequence of steps that takes raw data and transforms it into a format ready for machine learning. Instead of running cleaning steps manually each time, you encode them in a reproducible pipeline.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
# Build a pipeline: impute missing values → scale → train model
pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
("model", RandomForestClassifier(n_estimators=100))
])
# The pipeline handles everything in one call
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
accuracy = pipeline.score(X_test, y_test)Common Data Pitfalls
These mistakes are responsible for most ML failures. Learn to recognize and avoid them:
Data Leakage
Information from the test set or from the future influences training. Examples: scaling before splitting, using future data to predict the past, including the target variable (or a proxy for it) as a feature. This gives unrealistically high accuracy that disappears in production.
Selection Bias
Your data doesn't represent the real world. A model trained on reviews from power users won't generalize to casual users. A model trained on data from one geographic region may fail in another. Always ask: "Is this data representative of the population I'm making predictions for?"
Label Errors
Incorrect labels in training data teach the model wrong answers. If 5% of your "spam" labels are actually legitimate emails, the model learns to occasionally flag real emails as spam. Audit a random sample of your labels for accuracy, especially if labels were generated programmatically or by crowd workers.
Class Imbalance
When one class vastly outnumbers others — e.g., 99% legitimate transactions, 1% fraud. A model that always predicts "legitimate" gets 99% accuracy but catches zero fraud. Solutions include oversampling the minority class, undersampling the majority, using class weights, or evaluating with metrics like F1-score instead of accuracy.
Duplicate Data
Duplicates can appear in both training and test sets, causing the model to be evaluated on data it has already seen. Always deduplicate before splitting, and check for near-duplicates (rows that differ only in trivial ways like whitespace or formatting).
Recommended Resources
Kaggle Learn: Data Cleaning
Kaggle
Free hands-on micro-course covering missing values, scaling, and parsing dates — with interactive coding exercises in the browser.
Kaggle Learn: Feature Engineering
Kaggle
Free course on creating powerful features including mutual information, clustering features, and target encoding.
A Few Useful Things to Know About Machine Learning
Pedro Domingos (University of Washington)
Classic paper that explains why data and feature engineering matter more than algorithm choice. Essential reading for ML practitioners.
Data Preprocessing for Machine Learning
StatQuest with Josh Starmer
Clear, visual explanations of preprocessing techniques including normalization, encoding, and handling missing data.
Pandas Documentation
pandas development team
The comprehensive reference for Python's primary data manipulation library. The 10 Minutes to pandas tutorial is a great starting point.
Key Takeaways
- 1Data quality is more important than model complexity — a simple model with clean, well-prepared data usually outperforms a sophisticated model with messy data.
- 2Data comes in three forms: structured (tables), unstructured (text, images), and semi-structured (JSON, XML). Most real projects combine multiple types.
- 3Data cleaning is where the real work happens: handling missing values, detecting outliers, normalizing scales, and encoding categorical variables are essential preprocessing steps.
- 4Feature engineering — creating new informative variables from raw data — often has a bigger impact on model performance than choosing a different algorithm.
- 5Always split data into train/test sets before any preprocessing, and fit scalers only on training data to prevent data leakage.
- 6The most common data pitfalls are data leakage, selection bias, label errors, class imbalance, and duplicate records. Learning to avoid these is critical for building reliable ML systems.
Test Your Understanding
Module Assessment
5 questions · Score 70% or higher to complete this module
You can retake the quiz as many times as you need. Your best score is saved.