Data Fundamentals

Every AI model is only as good as the data it learns from. "Garbage in, garbage out" is the most important principle in machine learning. This module covers the practical fundamentals of working with data — from understanding data types and cleaning messy datasets to engineering useful features and building reliable data pipelines.

Why Data Is the Foundation

In the AI world, there's a common saying: "The team with the best data usually wins, not the team with the best algorithm." A simple model with excellent data will almost always outperform a sophisticated model with poor data. Here's why data quality matters so much:

Models learn patterns from data. If the data contains errors, the model learns those errors as truth
Bias in data becomes bias in predictions. If training data underrepresents certain groups, the model will perform poorly on those groups
Missing context leads to wrong conclusions. A model predicting house prices without location data will give meaningless results
Data quantity matters, but quality matters more. 10,000 clean, well-labeled examples often beat 1,000,000 noisy ones

The 80/20 Rule of ML

Data scientists spend roughly 80% of their time on data preparation (cleaning, transforming, feature engineering) and only 20% on actual modeling. Getting the data right is the real work — model selection and training are often the easier part.

Types of Data

Before you can work with data, you need to understand the different forms it takes. The type of data determines what tools and techniques you'll use.

Type	Description	Examples	Common Formats
Structured	Organized in rows and columns with defined types	Spreadsheets, databases, transaction logs	CSV, SQL databases, Excel
Unstructured	No predefined format or schema	Text documents, images, audio, video	PDF, JPG/PNG, MP3, MP4
Semi-structured	Has some organization but isn't fully tabular	JSON API responses, XML feeds, email metadata, log files	JSON, XML, YAML

Most real-world AI projects combine multiple data types. A customer sentiment analysis project might use structured data (purchase history from a database), unstructured data (support ticket text), and semi-structured data (JSON responses from a review API).

Data Cleaning and Preprocessing

Raw data is almost never ready for machine learning. It contains missing values, outliers, inconsistent formats, and errors. Cleaning and preprocessing is where you fix these issues.

Handling Missing Values

Missing data is ubiquitous. A customer might skip a survey question, a sensor might malfunction, or a form field might be optional. You need a strategy for dealing with it:

Strategy	When to Use	Risks
Drop rows	Few missing values (<5%), data is plentiful	May lose valuable data; can introduce bias if missingness is not random
Fill with mean/median	Numerical columns, values are roughly normally distributed	Reduces variance; use median if outliers are present
Fill with mode	Categorical columns (e.g., country, category)	Can over-represent the most common value
Forward/backward fill	Time-series data where adjacent values are relevant	Assumes continuity between time points
Flag as missing	Missingness itself might be informative	Adds complexity; model needs to learn what the flag means

import pandas as pd
import numpy as np

df = pd.read_csv("customers.csv")

# Check for missing values
print(df.isnull().sum())

# Drop rows where target variable is missing
df = df.dropna(subset=["revenue"])

# Fill numerical columns with median
df["age"] = df["age"].fillna(df["age"].median())

# Fill categorical columns with mode
df["country"] = df["country"].fillna(df["country"].mode()[0])

# Create a flag for missing values before filling
df["income_missing"] = df["income"].isnull().astype(int)
df["income"] = df["income"].fillna(df["income"].median())

Handling Outliers

Outliers are data points that are dramatically different from the rest. An age of 150 is clearly an error, but a salary of $10 million might be real (just unusual). The key question is always: is this a data error or a legitimate extreme value?

Detection Methods

Z-score:Values more than 3 standard deviations from the mean. Simple but assumes normal distribution.
IQR method:Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. More robust to non-normal distributions.
Visual inspection:Box plots and scatter plots often reveal outliers that statistical methods miss.

Treatment Options

Remove:If clearly erroneous (negative ages, impossible values).
Cap (winsorize):Replace extreme values with a percentile threshold (e.g., cap at the 99th percentile).
Transform:Apply log transformation to reduce the impact of extreme values while keeping them.
Keep:If the outlier is real and informative, leave it. Some models (tree-based) handle outliers well.

Normalization and Scaling

Many ML algorithms expect features to be on similar scales. If "age" ranges from 0-100 and "income" ranges from 0-10,000,000, the model will be dominated by income simply because the numbers are larger — not because income is more important.

Method	What It Does	When to Use
Min-Max Scaling	Scales to range [0, 1]: (x - min) / (max - min)	Neural networks, algorithms sensitive to feature magnitude
Standard Scaling (Z-score)	Centers at 0, unit variance: (x - mean) / std	Most ML algorithms, especially linear models and SVMs
Log Transformation	Applies log(x) to compress large ranges	Highly skewed data like income, population, or prices

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standard scaling (most common)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # use same scale as training!

# Min-Max scaling
minmax = MinMaxScaler()
X_train_normalized = minmax.fit_transform(X_train)
X_test_normalized = minmax.transform(X_test)

Scale After Splitting, Not Before

Always fit your scaler on the training data only, then apply it to the test data. If you scale the entire dataset before splitting, information from the test set "leaks" into the training process through the mean and standard deviation, giving you overly optimistic results. This is called data leakage and it's one of the most common mistakes in ML.

Feature Engineering

Feature engineering is the art of creating new, informative variables from your raw data. Good features can dramatically improve model performance — often more than switching to a better algorithm. It's where domain knowledge meets data science.

Common Feature Engineering Techniques

Date/Time Features

Extract meaningful components from timestamps:

df["day_of_week"] = df["date"].dt.dayofweek
df["month"] = df["date"].dt.month
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["hour"] = df["timestamp"].dt.hour
df["is_business_hours"] = df["hour"].between(9, 17).astype(int)

Aggregation Features

Summarize related data to create context:

# Customer-level features from transaction data
customer_features = transactions.groupby("customer_id").agg({
    "amount": ["mean", "sum", "count", "std"],
    "date": ["min", "max"],
}).reset_index()

# Time since last purchase
customer_features["days_since_last"] = (
    pd.Timestamp.now() - customer_features["date"]["max"]
).dt.days

Encoding Categorical Variables

ML models need numbers. Convert categories into numerical features:

# One-hot encoding: create binary columns for each category
df_encoded = pd.get_dummies(df, columns=["color", "size"])
# color_red, color_blue, color_green, size_S, size_M, size_L

# Label encoding: assign integers (for ordinal data)
size_map = {"S": 1, "M": 2, "L": 3, "XL": 4}
df["size_encoded"] = df["size"].map(size_map)

Interaction Features

Combine features to capture relationships:

# Price per square foot (combines two features meaningfully)
df["price_per_sqft"] = df["price"] / df["square_feet"]

# Ratio features
df["debt_to_income"] = df["total_debt"] / df["annual_income"]

# Polynomial features for non-linear relationships
df["age_squared"] = df["age"] ** 2

Domain Knowledge Is Your Superpower

The best features come from understanding the problem domain. A data scientist working on medical data creates better features by talking to doctors. An e-commerce model improves when someone who understands retail creates features like "days until next holiday" or "items in same category previously purchased." Technical skills plus domain knowledge is the winning combination.

Train/Test Splits: Why and How

The most fundamental concept in ML evaluation is the train/test split. You divide your data into two (or three) parts, train your model on one part, and evaluate it on data it has never seen. This is how you measure real-world performance rather than just memorization.

Why Split?

Imagine studying for an exam by memorizing the answer key. You'd get 100% on that specific test — but fail a different test on the same topic. That's what happens when you evaluate a model on its training data: it looks perfect but fails on new data. This is called overfitting. The test set simulates real-world data the model has never seen.

Split	Typical Size	Purpose
Training set	60-80%	Model learns patterns from this data
Validation set	10-20%	Tune hyperparameters and select between models (optional but recommended)
Test set	10-20%	Final, unbiased evaluation — only used once at the end

from sklearn.model_selection import train_test_split

# Simple train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% for testing
    random_state=42,    # reproducible split
    stratify=y          # maintain class proportions
)

# Three-way split: train / validation / test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.18, random_state=42, stratify=y_temp
)
# Results: ~70% train, ~15% validation, ~15% test

Time-Series Data: Don't Shuffle!

If your data has a time component (stock prices, sales over time, user behavior logs), never use random splitting. Instead, use a time-based split: train on older data, test on newer data. Random splitting lets the model "see the future," which gives unrealistically good results that won't hold in production.

Working with Common Data Formats

In practice, data comes from many sources and formats. Here's how to work with the most common ones:

CSV (Comma-Separated Values)

The most common format for tabular data. Simple, readable, and supported everywhere.

import pandas as pd

# Read CSV
df = pd.read_csv("data.csv")

# Handle common issues
df = pd.read_csv("data.csv",
    encoding="utf-8",        # handle special characters
    na_values=["N/A", ""],   # treat these as missing
    parse_dates=["date"],    # auto-parse date columns
    dtype={"zip_code": str}  # prevent treating zip codes as numbers
)

# Write CSV
df.to_csv("cleaned_data.csv", index=False)

JSON (JavaScript Object Notation)

The standard format for API responses. Nested and flexible, but requires flattening for ML use.

import json
import pandas as pd

# Read JSON
df = pd.read_json("data.json")

# For nested JSON (common with APIs)
with open("nested_data.json") as f:
    data = json.load(f)

# Flatten nested structure
df = pd.json_normalize(data["results"], sep="_")

# Fetch from API
import requests
response = requests.get("https://api.example.com/data")
data = response.json()
df = pd.DataFrame(data["items"])

SQL Databases

For production data stored in databases like PostgreSQL, MySQL, or SQLite:

import pandas as pd
import sqlite3

# Connect and query
conn = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM customers WHERE active = 1", conn)
conn.close()

# For larger databases, use SQLAlchemy
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@host/dbname")
df = pd.read_sql("SELECT * FROM orders WHERE date > '2025-01-01'", engine)

Data Pipelines

A data pipeline is an automated sequence of steps that takes raw data and transforms it into a format ready for machine learning. Instead of running cleaning steps manually each time, you encode them in a reproducible pipeline.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# Build a pipeline: impute missing values → scale → train model
pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier(n_estimators=100))
])

# The pipeline handles everything in one call
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
accuracy = pipeline.score(X_test, y_test)

Pipelines Prevent Leakage

Using scikit-learn Pipelines ensures that preprocessing steps (like scaling) are fit only on training data and then applied consistently to test data. This eliminates a whole category of data leakage bugs that are easy to introduce when processing steps are done manually.

Common Data Pitfalls

These mistakes are responsible for most ML failures. Learn to recognize and avoid them:

Data Leakage

Information from the test set or from the future influences training. Examples: scaling before splitting, using future data to predict the past, including the target variable (or a proxy for it) as a feature. This gives unrealistically high accuracy that disappears in production.

Selection Bias

Your data doesn't represent the real world. A model trained on reviews from power users won't generalize to casual users. A model trained on data from one geographic region may fail in another. Always ask: "Is this data representative of the population I'm making predictions for?"

Label Errors

Incorrect labels in training data teach the model wrong answers. If 5% of your "spam" labels are actually legitimate emails, the model learns to occasionally flag real emails as spam. Audit a random sample of your labels for accuracy, especially if labels were generated programmatically or by crowd workers.

Class Imbalance

When one class vastly outnumbers others — e.g., 99% legitimate transactions, 1% fraud. A model that always predicts "legitimate" gets 99% accuracy but catches zero fraud. Solutions include oversampling the minority class, undersampling the majority, using class weights, or evaluating with metrics like F1-score instead of accuracy.

Duplicate Data

Duplicates can appear in both training and test sets, causing the model to be evaluated on data it has already seen. Always deduplicate before splitting, and check for near-duplicates (rows that differ only in trivial ways like whitespace or formatting).

Recommended Resources

Course

Kaggle Learn: Data Cleaning

Kaggle

Free hands-on micro-course covering missing values, scaling, and parsing dates — with interactive coding exercises in the browser.

Course

Kaggle Learn: Feature Engineering

Kaggle

Free course on creating powerful features including mutual information, clustering features, and target encoding.

Article

A Few Useful Things to Know About Machine Learning

Pedro Domingos (University of Washington)

Classic paper that explains why data and feature engineering matter more than algorithm choice. Essential reading for ML practitioners.

Video

Data Preprocessing for Machine Learning

StatQuest with Josh Starmer

Clear, visual explanations of preprocessing techniques including normalization, encoding, and handling missing data.

Tool

Pandas Documentation

pandas development team

The comprehensive reference for Python's primary data manipulation library. The 10 Minutes to pandas tutorial is a great starting point.

Key Takeaways

1Data quality is more important than model complexity — a simple model with clean, well-prepared data usually outperforms a sophisticated model with messy data.
2Data comes in three forms: structured (tables), unstructured (text, images), and semi-structured (JSON, XML). Most real projects combine multiple types.
3Data cleaning is where the real work happens: handling missing values, detecting outliers, normalizing scales, and encoding categorical variables are essential preprocessing steps.
4Feature engineering — creating new informative variables from raw data — often has a bigger impact on model performance than choosing a different algorithm.
5Always split data into train/test sets before any preprocessing, and fit scalers only on training data to prevent data leakage.
6The most common data pitfalls are data leakage, selection bias, label errors, class imbalance, and duplicate records. Learning to avoid these is critical for building reliable ML systems.