Intermediate45 minModule 3 of 6

Data Fundamentals

Data types, cleaning, preprocessing, feature engineering, and data pipelines.

Every AI model is only as good as the data it learns from. "Garbage in, garbage out" is the most important principle in machine learning. This module covers the practical fundamentals of working with data — from understanding data types and cleaning messy datasets to engineering useful features and building reliable data pipelines.

Why Data Is the Foundation

In the AI world, there's a common saying: "The team with the best data usually wins, not the team with the best algorithm." A simple model with excellent data will almost always outperform a sophisticated model with poor data. Here's why data quality matters so much:

  • Models learn patterns from data. If the data contains errors, the model learns those errors as truth
  • Bias in data becomes bias in predictions. If training data underrepresents certain groups, the model will perform poorly on those groups
  • Missing context leads to wrong conclusions. A model predicting house prices without location data will give meaningless results
  • Data quantity matters, but quality matters more. 10,000 clean, well-labeled examples often beat 1,000,000 noisy ones
The 80/20 Rule of ML
Data scientists spend roughly 80% of their time on data preparation (cleaning, transforming, feature engineering) and only 20% on actual modeling. Getting the data right is the real work — model selection and training are often the easier part.

Types of Data

Before you can work with data, you need to understand the different forms it takes. The type of data determines what tools and techniques you'll use.

TypeDescriptionExamplesCommon Formats
StructuredOrganized in rows and columns with defined typesSpreadsheets, databases, transaction logsCSV, SQL databases, Excel
UnstructuredNo predefined format or schemaText documents, images, audio, videoPDF, JPG/PNG, MP3, MP4
Semi-structuredHas some organization but isn't fully tabularJSON API responses, XML feeds, email metadata, log filesJSON, XML, YAML

Most real-world AI projects combine multiple data types. A customer sentiment analysis project might use structured data (purchase history from a database), unstructured data (support ticket text), and semi-structured data (JSON responses from a review API).

Data Cleaning and Preprocessing

Raw data is almost never ready for machine learning. It contains missing values, outliers, inconsistent formats, and errors. Cleaning and preprocessing is where you fix these issues.

Handling Missing Values

Missing data is ubiquitous. A customer might skip a survey question, a sensor might malfunction, or a form field might be optional. You need a strategy for dealing with it:

StrategyWhen to UseRisks
Drop rowsFew missing values (<5%), data is plentifulMay lose valuable data; can introduce bias if missingness is not random
Fill with mean/medianNumerical columns, values are roughly normally distributedReduces variance; use median if outliers are present
Fill with modeCategorical columns (e.g., country, category)Can over-represent the most common value
Forward/backward fillTime-series data where adjacent values are relevantAssumes continuity between time points
Flag as missingMissingness itself might be informativeAdds complexity; model needs to learn what the flag means
import pandas as pd
import numpy as np

df = pd.read_csv("customers.csv")

# Check for missing values
print(df.isnull().sum())

# Drop rows where target variable is missing
df = df.dropna(subset=["revenue"])

# Fill numerical columns with median
df["age"] = df["age"].fillna(df["age"].median())

# Fill categorical columns with mode
df["country"] = df["country"].fillna(df["country"].mode()[0])

# Create a flag for missing values before filling
df["income_missing"] = df["income"].isnull().astype(int)
df["income"] = df["income"].fillna(df["income"].median())

Handling Outliers

Outliers are data points that are dramatically different from the rest. An age of 150 is clearly an error, but a salary of $10 million might be real (just unusual). The key question is always: is this a data error or a legitimate extreme value?

Detection Methods

  • Z-score:Values more than 3 standard deviations from the mean. Simple but assumes normal distribution.
  • IQR method:Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. More robust to non-normal distributions.
  • Visual inspection:Box plots and scatter plots often reveal outliers that statistical methods miss.

Treatment Options

  • Remove:If clearly erroneous (negative ages, impossible values).
  • Cap (winsorize):Replace extreme values with a percentile threshold (e.g., cap at the 99th percentile).
  • Transform:Apply log transformation to reduce the impact of extreme values while keeping them.
  • Keep:If the outlier is real and informative, leave it. Some models (tree-based) handle outliers well.

Normalization and Scaling

Many ML algorithms expect features to be on similar scales. If "age" ranges from 0-100 and "income" ranges from 0-10,000,000, the model will be dominated by income simply because the numbers are larger — not because income is more important.

MethodWhat It DoesWhen to Use
Min-Max ScalingScales to range [0, 1]: (x - min) / (max - min)Neural networks, algorithms sensitive to feature magnitude
Standard Scaling (Z-score)Centers at 0, unit variance: (x - mean) / stdMost ML algorithms, especially linear models and SVMs
Log TransformationApplies log(x) to compress large rangesHighly skewed data like income, population, or prices
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standard scaling (most common)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # use same scale as training!

# Min-Max scaling
minmax = MinMaxScaler()
X_train_normalized = minmax.fit_transform(X_train)
X_test_normalized = minmax.transform(X_test)
Scale After Splitting, Not Before
Always fit your scaler on the training data only, then apply it to the test data. If you scale the entire dataset before splitting, information from the test set "leaks" into the training process through the mean and standard deviation, giving you overly optimistic results. This is called data leakage and it's one of the most common mistakes in ML.

Feature Engineering

Feature engineering is the art of creating new, informative variables from your raw data. Good features can dramatically improve model performance — often more than switching to a better algorithm. It's where domain knowledge meets data science.

Common Feature Engineering Techniques

Date/Time Features

Extract meaningful components from timestamps:

df["day_of_week"] = df["date"].dt.dayofweek
df["month"] = df["date"].dt.month
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["hour"] = df["timestamp"].dt.hour
df["is_business_hours"] = df["hour"].between(9, 17).astype(int)

Aggregation Features

Summarize related data to create context:

# Customer-level features from transaction data
customer_features = transactions.groupby("customer_id").agg({
    "amount": ["mean", "sum", "count", "std"],
    "date": ["min", "max"],
}).reset_index()

# Time since last purchase
customer_features["days_since_last"] = (
    pd.Timestamp.now() - customer_features["date"]["max"]
).dt.days

Encoding Categorical Variables

ML models need numbers. Convert categories into numerical features:

# One-hot encoding: create binary columns for each category
df_encoded = pd.get_dummies(df, columns=["color", "size"])
# color_red, color_blue, color_green, size_S, size_M, size_L

# Label encoding: assign integers (for ordinal data)
size_map = {"S": 1, "M": 2, "L": 3, "XL": 4}
df["size_encoded"] = df["size"].map(size_map)

Interaction Features

Combine features to capture relationships:

# Price per square foot (combines two features meaningfully)
df["price_per_sqft"] = df["price"] / df["square_feet"]

# Ratio features
df["debt_to_income"] = df["total_debt"] / df["annual_income"]

# Polynomial features for non-linear relationships
df["age_squared"] = df["age"] ** 2
Domain Knowledge Is Your Superpower
The best features come from understanding the problem domain. A data scientist working on medical data creates better features by talking to doctors. An e-commerce model improves when someone who understands retail creates features like "days until next holiday" or "items in same category previously purchased." Technical skills plus domain knowledge is the winning combination.

Train/Test Splits: Why and How

The most fundamental concept in ML evaluation is the train/test split. You divide your data into two (or three) parts, train your model on one part, and evaluate it on data it has never seen. This is how you measure real-world performance rather than just memorization.

Why Split?

Imagine studying for an exam by memorizing the answer key. You'd get 100% on that specific test — but fail a different test on the same topic. That's what happens when you evaluate a model on its training data: it looks perfect but fails on new data. This is called overfitting. The test set simulates real-world data the model has never seen.

SplitTypical SizePurpose
Training set60-80%Model learns patterns from this data
Validation set10-20%Tune hyperparameters and select between models (optional but recommended)
Test set10-20%Final, unbiased evaluation — only used once at the end
from sklearn.model_selection import train_test_split

# Simple train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% for testing
    random_state=42,    # reproducible split
    stratify=y          # maintain class proportions
)

# Three-way split: train / validation / test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.18, random_state=42, stratify=y_temp
)
# Results: ~70% train, ~15% validation, ~15% test
Time-Series Data: Don't Shuffle!
If your data has a time component (stock prices, sales over time, user behavior logs), never use random splitting. Instead, use a time-based split: train on older data, test on newer data. Random splitting lets the model "see the future," which gives unrealistically good results that won't hold in production.

Working with Common Data Formats

In practice, data comes from many sources and formats. Here's how to work with the most common ones:

CSV (Comma-Separated Values)

The most common format for tabular data. Simple, readable, and supported everywhere.

import pandas as pd

# Read CSV
df = pd.read_csv("data.csv")

# Handle common issues
df = pd.read_csv("data.csv",
    encoding="utf-8",        # handle special characters
    na_values=["N/A", ""],   # treat these as missing
    parse_dates=["date"],    # auto-parse date columns
    dtype={"zip_code": str}  # prevent treating zip codes as numbers
)

# Write CSV
df.to_csv("cleaned_data.csv", index=False)

JSON (JavaScript Object Notation)

The standard format for API responses. Nested and flexible, but requires flattening for ML use.

import json
import pandas as pd

# Read JSON
df = pd.read_json("data.json")

# For nested JSON (common with APIs)
with open("nested_data.json") as f:
    data = json.load(f)

# Flatten nested structure
df = pd.json_normalize(data["results"], sep="_")

# Fetch from API
import requests
response = requests.get("https://api.example.com/data")
data = response.json()
df = pd.DataFrame(data["items"])

SQL Databases

For production data stored in databases like PostgreSQL, MySQL, or SQLite:

import pandas as pd
import sqlite3

# Connect and query
conn = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM customers WHERE active = 1", conn)
conn.close()

# For larger databases, use SQLAlchemy
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@host/dbname")
df = pd.read_sql("SELECT * FROM orders WHERE date > '2025-01-01'", engine)

Data Pipelines

A data pipeline is an automated sequence of steps that takes raw data and transforms it into a format ready for machine learning. Instead of running cleaning steps manually each time, you encode them in a reproducible pipeline.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# Build a pipeline: impute missing values → scale → train model
pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier(n_estimators=100))
])

# The pipeline handles everything in one call
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
accuracy = pipeline.score(X_test, y_test)
Pipelines Prevent Leakage
Using scikit-learn Pipelines ensures that preprocessing steps (like scaling) are fit only on training data and then applied consistently to test data. This eliminates a whole category of data leakage bugs that are easy to introduce when processing steps are done manually.

Common Data Pitfalls

These mistakes are responsible for most ML failures. Learn to recognize and avoid them:

!

Data Leakage

Information from the test set or from the future influences training. Examples: scaling before splitting, using future data to predict the past, including the target variable (or a proxy for it) as a feature. This gives unrealistically high accuracy that disappears in production.

!

Selection Bias

Your data doesn't represent the real world. A model trained on reviews from power users won't generalize to casual users. A model trained on data from one geographic region may fail in another. Always ask: "Is this data representative of the population I'm making predictions for?"

!

Label Errors

Incorrect labels in training data teach the model wrong answers. If 5% of your "spam" labels are actually legitimate emails, the model learns to occasionally flag real emails as spam. Audit a random sample of your labels for accuracy, especially if labels were generated programmatically or by crowd workers.

!

Class Imbalance

When one class vastly outnumbers others — e.g., 99% legitimate transactions, 1% fraud. A model that always predicts "legitimate" gets 99% accuracy but catches zero fraud. Solutions include oversampling the minority class, undersampling the majority, using class weights, or evaluating with metrics like F1-score instead of accuracy.

!

Duplicate Data

Duplicates can appear in both training and test sets, causing the model to be evaluated on data it has already seen. Always deduplicate before splitting, and check for near-duplicates (rows that differ only in trivial ways like whitespace or formatting).

Recommended Resources

Key Takeaways

  • 1Data quality is more important than model complexity — a simple model with clean, well-prepared data usually outperforms a sophisticated model with messy data.
  • 2Data comes in three forms: structured (tables), unstructured (text, images), and semi-structured (JSON, XML). Most real projects combine multiple types.
  • 3Data cleaning is where the real work happens: handling missing values, detecting outliers, normalizing scales, and encoding categorical variables are essential preprocessing steps.
  • 4Feature engineering — creating new informative variables from raw data — often has a bigger impact on model performance than choosing a different algorithm.
  • 5Always split data into train/test sets before any preprocessing, and fit scalers only on training data to prevent data leakage.
  • 6The most common data pitfalls are data leakage, selection bias, label errors, class imbalance, and duplicate records. Learning to avoid these is critical for building reliable ML systems.

Test Your Understanding

Module Assessment

5 questions · Score 70% or higher to complete this module

You can retake the quiz as many times as you need. Your best score is saved.

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.