Introduction to Machine Learning

Machine learning is the engine behind most modern AI. Rather than being explicitly programmed with rules, ML systems learn patterns from data and use those patterns to make predictions or decisions. This module covers the core concepts, algorithms, and practical tools you need to understand how machine learning works and start building your own models.

What Is Machine Learning?

Traditional programming follows a simple formula: you give the computer rules and data, and it produces answers. Machine learning flips this — you give the computer data and answers (labels), and it learns the rules on its own. Those learned rules can then be applied to new, unseen data.

A useful mental model: think of ML as pattern recognition at scale. Humans are great at recognizing patterns — faces, voices, handwriting — but we struggle with millions of data points across hundreds of dimensions. Machine learning excels exactly where human intuition breaks down.

Supervised vs. Unsupervised Learning

The two foundational paradigms of machine learning differ in one key way: whether or not the training data comes with labels (known answers).

Aspect	Supervised Learning	Unsupervised Learning
Training data	Labeled — each input has a known correct output	Unlabeled — no correct answers provided
Goal	Learn the mapping from inputs to outputs	Discover hidden structure or patterns in data
Example tasks	Spam detection, price prediction, medical diagnosis	Customer segmentation, anomaly detection, topic modeling
Key algorithms	Linear regression, decision trees, SVMs, neural networks	K-means clustering, DBSCAN, PCA, autoencoders

There Are Other Paradigms Too

Reinforcement learning is a third major paradigm where an agent learns by interacting with an environment and receiving rewards or penalties. This is how game-playing AIs like AlphaGo and robotics systems learn. Semi-supervised learning combines small amounts of labeled data with large amounts of unlabeled data — a practical approach when labeling is expensive.

Regression: Predicting Numbers

Regression is a supervised learning task where the goal is to predict a continuous numeric value. If the answer to your question is a number on a spectrum, you likely need regression.

Examples of Regression Problems

Predicting house prices based on square footage, location, and features
Estimating a customer's lifetime value
Forecasting next quarter's revenue
Predicting how long a delivery will take

Linear Regression: The Simplest Model

Linear regression fits a straight line (or hyperplane in higher dimensions) through your data. The line minimizes the distance between predicted values and actual values. Despite its simplicity, linear regression is surprisingly effective for many real-world problems and serves as the foundation for understanding more complex models.

Linear regression with scikit-learn:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Example: predicting house prices
# X = features (square footage, bedrooms, age)
# y = price in thousands
X = np.array([[1400, 3, 10], [1600, 3, 5], [1700, 4, 15],
              [1875, 4, 8], [1100, 2, 20], [2200, 5, 3]])
y = np.array([245, 312, 279, 308, 199, 420])

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)
print(f"Predicted prices: {predictions}")
print(f"Model score (R²): {model.score(X_test, y_test):.2f}")

Classification: Predicting Categories

Classification is a supervised learning task where the goal is to assign inputs to discrete categories. If the answer to your question is one of several possible labels, you need classification.

Examples of Classification Problems

Binary classification (two classes): spam or not spam, fraud or legitimate, tumor is benign or malignant
Multi-class classification (more than two): categorizing emails into folders, identifying animal species from photos, sentiment analysis (positive, negative, neutral)

Logistic Regression for Classification

Despite the name, logistic regression is a classification algorithm. It predicts the probability that an input belongs to a particular class. If the probability exceeds a threshold (usually 0.5), the input is assigned to that class.

Binary classification with scikit-learn:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# Load a real dataset — breast cancer diagnosis
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Train a logistic regression classifier
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train)

# Evaluate
accuracy = clf.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")  # Typically ~95-97%

# Predict on new data
prediction = clf.predict(X_test[:1])
probability = clf.predict_proba(X_test[:1])
print(f"Prediction: {'Benign' if prediction[0] else 'Malignant'}")
print(f"Confidence: {probability[0].max():.2%}")

Clustering: Finding Groups

Clustering is an unsupervised learning task where the algorithm finds natural groupings in your data without being told what the groups should be. The algorithm discovers structure on its own.

Examples of Clustering

Segmenting customers into groups based on purchasing behavior
Grouping similar news articles together
Identifying distinct user behavior patterns on a website
Detecting anomalies (data points that don't belong to any cluster)

K-means clustering example:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np

# Example: customer segmentation based on annual spend and frequency
customers = np.array([
    [15000, 52], [22000, 48], [18000, 55],   # High-value frequent
    [2000, 12], [1500, 8], [3000, 15],       # Low-value infrequent
    [8000, 45], [9500, 50], [7000, 40],      # Medium-value frequent
])

# Always scale features before clustering
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)

# Fit K-means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(customers_scaled)

print(f"Cluster assignments: {labels}")
# Output: each customer assigned to cluster 0, 1, or 2

# Examine cluster centers (inverse-transform to original scale)
centers = scaler.inverse_transform(kmeans.cluster_centers_)
for i, center in enumerate(centers):
    print(f"Cluster {i}: Avg spend=${center[0]:.0f}, "
          f"Avg frequency={center[1]:.0f} visits/year")

How Many Clusters?

Choosing the right number of clusters (k) is one of the trickiest parts of clustering. The "elbow method" plots the sum of squared distances for different values of k — the optimal k is where the curve bends sharply. The silhouette score is another useful metric that measures how well each data point fits its assigned cluster.

Key Algorithms Explained Intuitively

Decision Trees

A decision tree makes predictions by asking a series of yes/no questions about the data, forming a tree-like flowchart. At each node, the algorithm chooses the question that best separates the data. Think of it like a game of 20 questions — each question narrows down the possibilities.

Strengths: Easy to understand and visualize. Handles both numerical and categorical data. Requires minimal data preprocessing.

Weaknesses: Prone to overfitting — a tree can memorize the training data instead of learning general patterns. Individual trees can be unstable (small changes in data can produce very different trees).

Random Forests

A random forest solves the overfitting problem of decision trees by building hundreds or thousands of trees, each trained on a random subset of the data and features. The final prediction is determined by majority vote (classification) or average (regression) across all trees.

The "wisdom of crowds" effect makes random forests remarkably accurate and robust. They are one of the most reliable out-of-the-box algorithms for tabular data and remain a top choice even in the age of deep learning. In many Kaggle competitions, gradient-boosted trees (a related ensemble method) still outperform neural networks on structured data.

Support Vector Machines (SVMs)

SVMs find the optimal boundary (hyperplane) that separates different classes with the widest possible margin. Imagine scattering red and blue dots on a table — an SVM finds the line that separates them with the most breathing room on each side.

The "kernel trick" allows SVMs to handle data that isn't linearly separable by projecting it into a higher-dimensional space where a clean boundary exists. SVMs work well with small-to-medium datasets and high-dimensional data (like text classification), but can be slow to train on very large datasets.

Comparing algorithms with scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

# Load the classic Iris dataset
data = load_iris()
X, y = data.data, data.target

# Compare three algorithms using 5-fold cross-validation
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(
        n_estimators=100, random_state=42
    ),
    "SVM": SVC(kernel="rbf", random_state=42),
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
    print(f"{name}: {scores.mean():.2%} (+/- {scores.std():.2%})")

Evaluation Metrics: How Good Is Your Model?

Accuracy alone can be misleading. If 99% of emails are legitimate, a model that always predicts "not spam" is 99% accurate but completely useless. You need metrics that capture different aspects of model performance.

Metric	What It Measures	When to Use
Accuracy	Percentage of correct predictions overall	When classes are balanced (roughly equal numbers)
Precision	Of all positive predictions, how many were actually positive?	When false positives are costly (e.g., flagging legitimate transactions as fraud)
Recall	Of all actual positives, how many did the model catch?	When false negatives are costly (e.g., missing a disease diagnosis)
F1 Score	Harmonic mean of precision and recall — balances both	When you need a single metric that accounts for both false positives and false negatives

Computing evaluation metrics:

from sklearn.metrics import (
    accuracy_score, precision_score,
    recall_score, f1_score, classification_report
)

# Assume y_test and y_pred are your true labels and predictions
# Example with binary classification results
y_test = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

print(f"Accuracy:  {accuracy_score(y_test, y_pred):.2%}")
print(f"Precision: {precision_score(y_test, y_pred):.2%}")
print(f"Recall:    {recall_score(y_test, y_pred):.2%}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.2%}")

# Full classification report with per-class metrics
print("\n", classification_report(y_test, y_pred,
      target_names=["Negative", "Positive"]))

The Precision-Recall Trade-off

Increasing precision typically decreases recall, and vice versa. A spam filter tuned for high precision will miss some spam (low recall) but rarely marks legitimate emails as spam. A filter tuned for high recall will catch almost all spam but may flag some legitimate emails too. Choose the balance based on the real-world cost of each type of error.

Overfitting vs. Underfitting

This is arguably the single most important concept in machine learning. Getting this balance right is what separates a model that works in the real world from one that only works on your training data.

Overfitting (Too Complex)

The model memorizes the training data, including its noise and random fluctuations. It performs brilliantly on training data but poorly on new, unseen data.

Analogy: A student who memorizes every practice exam answer but can't solve a new problem they haven't seen before.

Signs: High training accuracy, low test accuracy. Large gap between training and validation performance.

Underfitting (Too Simple)

The model is too simple to capture the underlying patterns in the data. It performs poorly on both training data and new data.

Analogy: A student who only learned the chapter titles but never read the material — they can't answer any questions well.

Signs: Low training accuracy and low test accuracy. The model fails to capture obvious patterns.

Strategies to Combat Overfitting

More training data — the most effective remedy when available
Regularization — adds a penalty for model complexity (L1/L2 regularization)
Simpler models — fewer parameters, shallower trees, fewer features
Early stopping — stop training when validation performance starts declining
Cross-validation — evaluate on multiple data splits (covered next)
Dropout (for neural networks) — randomly deactivate neurons during training

Cross-Validation: Reliable Model Evaluation

A single train/test split can be misleading — you might get lucky or unlucky with how the data was divided. Cross-validation solves this by splitting the data multiple ways and averaging the results.

K-Fold Cross-Validation

The most common approach is k-fold cross-validation (typically k=5 or k=10). The data is divided into k equal parts (folds). The model is trained k times, each time using a different fold as the test set and the remaining k-1 folds as training data. The final score is the average across all k runs.

K-fold cross-validation in practice:

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

# Load dataset
data = load_wine()
X, y = data.data, data.target

# Create a model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy")

print(f"Fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.2%}")
print(f"Std deviation: {scores.std():.2%}")
# A small std deviation means the model performs consistently
# A large std deviation suggests the model is sensitive to the data split

The Scikit-learn Workflow

Almost every scikit-learn project follows the same pattern: (1) load and prepare data, (2) split into train/test, (3) choose a model and fit it, (4) evaluate with cross-validation, (5) tune hyperparameters, (6) evaluate on the held-out test set. Once you learn this workflow, you can apply it to any algorithm — the API is consistent across all models.

The Scikit-learn Ecosystem

Scikit-learn (sklearn) is the most widely used Python library for classical machine learning. Its consistent API, excellent documentation, and comprehensive collection of algorithms make it the standard tool for ML practitioners. As of 2026, it remains the go-to library for tabular data and traditional ML tasks — deep learning frameworks like PyTorch handle neural networks, but scikit-learn dominates everything else.

Core scikit-learn Modules

Module	Purpose	Key Classes
sklearn.model_selection	Splitting data, cross-validation, hyperparameter tuning	train_test_split, cross_val_score, GridSearchCV
sklearn.preprocessing	Scaling, encoding, transforming features	StandardScaler, LabelEncoder, OneHotEncoder
sklearn.metrics	Evaluating model performance	accuracy_score, f1_score, classification_report
sklearn.ensemble	Ensemble methods that combine multiple models	RandomForestClassifier, GradientBoostingClassifier
sklearn.pipeline	Chaining preprocessing and modeling steps	Pipeline, make_pipeline

A complete scikit-learn pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    train_test_split, GridSearchCV
)
from sklearn.datasets import load_wine
from sklearn.metrics import classification_report

# Load data
data = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Build a pipeline: scale features → train model
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(random_state=42)),
])

# Hyperparameter tuning with grid search
param_grid = {
    "clf__n_estimators": [50, 100, 200],
    "clf__max_depth": [None, 10, 20],
}
search = GridSearchCV(pipe, param_grid, cv=5, scoring="f1_macro")
search.fit(X_train, y_train)

print(f"Best params: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.2%}")

# Evaluate on test set
y_pred = search.predict(X_test)
print(classification_report(y_test, y_pred,
      target_names=data.target_names))

Resources

Course

Machine Learning Specialization

Andrew Ng (Stanford / Coursera)

The gold standard introduction to machine learning. Andrew Ng's updated 2022 specialization covers supervised learning, unsupervised learning, and best practices with Python and scikit-learn.

Video

StatQuest: Machine Learning

Josh Starmer

Brilliantly clear, visual explanations of ML concepts including decision trees, random forests, SVMs, cross-validation, and evaluation metrics. No prerequisites needed.

Article

Scikit-learn User Guide

Scikit-learn Contributors

The official scikit-learn documentation with detailed explanations and code examples for every algorithm and utility in the library.