Intermediate55 minModule 4 of 6

Introduction to Machine Learning

Supervised vs unsupervised learning. Regression, classification, clustering with scikit-learn.

Machine learning is the engine behind most modern AI. Rather than being explicitly programmed with rules, ML systems learn patterns from data and use those patterns to make predictions or decisions. This module covers the core concepts, algorithms, and practical tools you need to understand how machine learning works and start building your own models.

What Is Machine Learning?

Traditional programming follows a simple formula: you give the computer rules and data, and it produces answers. Machine learning flips this — you give the computer data and answers (labels), and it learns the rules on its own. Those learned rules can then be applied to new, unseen data.

A useful mental model: think of ML as pattern recognition at scale. Humans are great at recognizing patterns — faces, voices, handwriting — but we struggle with millions of data points across hundreds of dimensions. Machine learning excels exactly where human intuition breaks down.

Supervised vs. Unsupervised Learning

The two foundational paradigms of machine learning differ in one key way: whether or not the training data comes with labels (known answers).

AspectSupervised LearningUnsupervised Learning
Training dataLabeled — each input has a known correct outputUnlabeled — no correct answers provided
GoalLearn the mapping from inputs to outputsDiscover hidden structure or patterns in data
Example tasksSpam detection, price prediction, medical diagnosisCustomer segmentation, anomaly detection, topic modeling
Key algorithmsLinear regression, decision trees, SVMs, neural networksK-means clustering, DBSCAN, PCA, autoencoders
There Are Other Paradigms Too
Reinforcement learning is a third major paradigm where an agent learns by interacting with an environment and receiving rewards or penalties. This is how game-playing AIs like AlphaGo and robotics systems learn. Semi-supervised learning combines small amounts of labeled data with large amounts of unlabeled data — a practical approach when labeling is expensive.

Regression: Predicting Numbers

Regression is a supervised learning task where the goal is to predict a continuous numeric value. If the answer to your question is a number on a spectrum, you likely need regression.

Examples of Regression Problems

  • Predicting house prices based on square footage, location, and features
  • Estimating a customer's lifetime value
  • Forecasting next quarter's revenue
  • Predicting how long a delivery will take

Linear Regression: The Simplest Model

Linear regression fits a straight line (or hyperplane in higher dimensions) through your data. The line minimizes the distance between predicted values and actual values. Despite its simplicity, linear regression is surprisingly effective for many real-world problems and serves as the foundation for understanding more complex models.

Linear regression with scikit-learn:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Example: predicting house prices
# X = features (square footage, bedrooms, age)
# y = price in thousands
X = np.array([[1400, 3, 10], [1600, 3, 5], [1700, 4, 15],
              [1875, 4, 8], [1100, 2, 20], [2200, 5, 3]])
y = np.array([245, 312, 279, 308, 199, 420])

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)
print(f"Predicted prices: {predictions}")
print(f"Model score (R²): {model.score(X_test, y_test):.2f}")

Classification: Predicting Categories

Classification is a supervised learning task where the goal is to assign inputs to discrete categories. If the answer to your question is one of several possible labels, you need classification.

Examples of Classification Problems

  • Binary classification (two classes): spam or not spam, fraud or legitimate, tumor is benign or malignant
  • Multi-class classification (more than two): categorizing emails into folders, identifying animal species from photos, sentiment analysis (positive, negative, neutral)

Logistic Regression for Classification

Despite the name, logistic regression is a classification algorithm. It predicts the probability that an input belongs to a particular class. If the probability exceeds a threshold (usually 0.5), the input is assigned to that class.

Binary classification with scikit-learn:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# Load a real dataset — breast cancer diagnosis
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Train a logistic regression classifier
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train)

# Evaluate
accuracy = clf.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")  # Typically ~95-97%

# Predict on new data
prediction = clf.predict(X_test[:1])
probability = clf.predict_proba(X_test[:1])
print(f"Prediction: {'Benign' if prediction[0] else 'Malignant'}")
print(f"Confidence: {probability[0].max():.2%}")

Clustering: Finding Groups

Clustering is an unsupervised learning task where the algorithm finds natural groupings in your data without being told what the groups should be. The algorithm discovers structure on its own.

Examples of Clustering

  • Segmenting customers into groups based on purchasing behavior
  • Grouping similar news articles together
  • Identifying distinct user behavior patterns on a website
  • Detecting anomalies (data points that don't belong to any cluster)

K-means clustering example:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np

# Example: customer segmentation based on annual spend and frequency
customers = np.array([
    [15000, 52], [22000, 48], [18000, 55],   # High-value frequent
    [2000, 12], [1500, 8], [3000, 15],       # Low-value infrequent
    [8000, 45], [9500, 50], [7000, 40],      # Medium-value frequent
])

# Always scale features before clustering
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)

# Fit K-means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(customers_scaled)

print(f"Cluster assignments: {labels}")
# Output: each customer assigned to cluster 0, 1, or 2

# Examine cluster centers (inverse-transform to original scale)
centers = scaler.inverse_transform(kmeans.cluster_centers_)
for i, center in enumerate(centers):
    print(f"Cluster {i}: Avg spend=${center[0]:.0f}, "
          f"Avg frequency={center[1]:.0f} visits/year")
How Many Clusters?
Choosing the right number of clusters (k) is one of the trickiest parts of clustering. The "elbow method" plots the sum of squared distances for different values of k — the optimal k is where the curve bends sharply. The silhouette score is another useful metric that measures how well each data point fits its assigned cluster.

Key Algorithms Explained Intuitively

Decision Trees

A decision tree makes predictions by asking a series of yes/no questions about the data, forming a tree-like flowchart. At each node, the algorithm chooses the question that best separates the data. Think of it like a game of 20 questions — each question narrows down the possibilities.

Strengths: Easy to understand and visualize. Handles both numerical and categorical data. Requires minimal data preprocessing.

Weaknesses: Prone to overfitting — a tree can memorize the training data instead of learning general patterns. Individual trees can be unstable (small changes in data can produce very different trees).

Random Forests

A random forest solves the overfitting problem of decision trees by building hundreds or thousands of trees, each trained on a random subset of the data and features. The final prediction is determined by majority vote (classification) or average (regression) across all trees.

The "wisdom of crowds" effect makes random forests remarkably accurate and robust. They are one of the most reliable out-of-the-box algorithms for tabular data and remain a top choice even in the age of deep learning. In many Kaggle competitions, gradient-boosted trees (a related ensemble method) still outperform neural networks on structured data.

Support Vector Machines (SVMs)

SVMs find the optimal boundary (hyperplane) that separates different classes with the widest possible margin. Imagine scattering red and blue dots on a table — an SVM finds the line that separates them with the most breathing room on each side.

The "kernel trick" allows SVMs to handle data that isn't linearly separable by projecting it into a higher-dimensional space where a clean boundary exists. SVMs work well with small-to-medium datasets and high-dimensional data (like text classification), but can be slow to train on very large datasets.

Comparing algorithms with scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

# Load the classic Iris dataset
data = load_iris()
X, y = data.data, data.target

# Compare three algorithms using 5-fold cross-validation
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(
        n_estimators=100, random_state=42
    ),
    "SVM": SVC(kernel="rbf", random_state=42),
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
    print(f"{name}: {scores.mean():.2%} (+/- {scores.std():.2%})")

Evaluation Metrics: How Good Is Your Model?

Accuracy alone can be misleading. If 99% of emails are legitimate, a model that always predicts "not spam" is 99% accurate but completely useless. You need metrics that capture different aspects of model performance.

MetricWhat It MeasuresWhen to Use
AccuracyPercentage of correct predictions overallWhen classes are balanced (roughly equal numbers)
PrecisionOf all positive predictions, how many were actually positive?When false positives are costly (e.g., flagging legitimate transactions as fraud)
RecallOf all actual positives, how many did the model catch?When false negatives are costly (e.g., missing a disease diagnosis)
F1 ScoreHarmonic mean of precision and recall — balances bothWhen you need a single metric that accounts for both false positives and false negatives

Computing evaluation metrics:

from sklearn.metrics import (
    accuracy_score, precision_score,
    recall_score, f1_score, classification_report
)

# Assume y_test and y_pred are your true labels and predictions
# Example with binary classification results
y_test = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

print(f"Accuracy:  {accuracy_score(y_test, y_pred):.2%}")
print(f"Precision: {precision_score(y_test, y_pred):.2%}")
print(f"Recall:    {recall_score(y_test, y_pred):.2%}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.2%}")

# Full classification report with per-class metrics
print("\n", classification_report(y_test, y_pred,
      target_names=["Negative", "Positive"]))
The Precision-Recall Trade-off
Increasing precision typically decreases recall, and vice versa. A spam filter tuned for high precision will miss some spam (low recall) but rarely marks legitimate emails as spam. A filter tuned for high recall will catch almost all spam but may flag some legitimate emails too. Choose the balance based on the real-world cost of each type of error.

Overfitting vs. Underfitting

This is arguably the single most important concept in machine learning. Getting this balance right is what separates a model that works in the real world from one that only works on your training data.

Overfitting (Too Complex)

The model memorizes the training data, including its noise and random fluctuations. It performs brilliantly on training data but poorly on new, unseen data.

Analogy: A student who memorizes every practice exam answer but can't solve a new problem they haven't seen before.

Signs: High training accuracy, low test accuracy. Large gap between training and validation performance.

Underfitting (Too Simple)

The model is too simple to capture the underlying patterns in the data. It performs poorly on both training data and new data.

Analogy: A student who only learned the chapter titles but never read the material — they can't answer any questions well.

Signs: Low training accuracy and low test accuracy. The model fails to capture obvious patterns.

Strategies to Combat Overfitting

  • More training data — the most effective remedy when available
  • Regularization — adds a penalty for model complexity (L1/L2 regularization)
  • Simpler models — fewer parameters, shallower trees, fewer features
  • Early stopping — stop training when validation performance starts declining
  • Cross-validation — evaluate on multiple data splits (covered next)
  • Dropout (for neural networks) — randomly deactivate neurons during training

Cross-Validation: Reliable Model Evaluation

A single train/test split can be misleading — you might get lucky or unlucky with how the data was divided. Cross-validation solves this by splitting the data multiple ways and averaging the results.

K-Fold Cross-Validation

The most common approach is k-fold cross-validation (typically k=5 or k=10). The data is divided into k equal parts (folds). The model is trained k times, each time using a different fold as the test set and the remaining k-1 folds as training data. The final score is the average across all k runs.

K-fold cross-validation in practice:

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

# Load dataset
data = load_wine()
X, y = data.data, data.target

# Create a model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy")

print(f"Fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.2%}")
print(f"Std deviation: {scores.std():.2%}")
# A small std deviation means the model performs consistently
# A large std deviation suggests the model is sensitive to the data split
The Scikit-learn Workflow
Almost every scikit-learn project follows the same pattern: (1) load and prepare data, (2) split into train/test, (3) choose a model and fit it, (4) evaluate with cross-validation, (5) tune hyperparameters, (6) evaluate on the held-out test set. Once you learn this workflow, you can apply it to any algorithm — the API is consistent across all models.

The Scikit-learn Ecosystem

Scikit-learn (sklearn) is the most widely used Python library for classical machine learning. Its consistent API, excellent documentation, and comprehensive collection of algorithms make it the standard tool for ML practitioners. As of 2026, it remains the go-to library for tabular data and traditional ML tasks — deep learning frameworks like PyTorch handle neural networks, but scikit-learn dominates everything else.

Core scikit-learn Modules

ModulePurposeKey Classes
sklearn.model_selectionSplitting data, cross-validation, hyperparameter tuningtrain_test_split, cross_val_score, GridSearchCV
sklearn.preprocessingScaling, encoding, transforming featuresStandardScaler, LabelEncoder, OneHotEncoder
sklearn.metricsEvaluating model performanceaccuracy_score, f1_score, classification_report
sklearn.ensembleEnsemble methods that combine multiple modelsRandomForestClassifier, GradientBoostingClassifier
sklearn.pipelineChaining preprocessing and modeling stepsPipeline, make_pipeline

A complete scikit-learn pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    train_test_split, GridSearchCV
)
from sklearn.datasets import load_wine
from sklearn.metrics import classification_report

# Load data
data = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Build a pipeline: scale features → train model
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(random_state=42)),
])

# Hyperparameter tuning with grid search
param_grid = {
    "clf__n_estimators": [50, 100, 200],
    "clf__max_depth": [None, 10, 20],
}
search = GridSearchCV(pipe, param_grid, cv=5, scoring="f1_macro")
search.fit(X_train, y_train)

print(f"Best params: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.2%}")

# Evaluate on test set
y_pred = search.predict(X_test)
print(classification_report(y_test, y_pred,
      target_names=data.target_names))

Resources

Key Takeaways

  • 1Machine learning learns patterns from data rather than following explicit rules — supervised learning uses labeled data, unsupervised learning finds hidden structure in unlabeled data.
  • 2Regression predicts continuous numbers (prices, quantities), classification predicts discrete categories (spam/not spam), and clustering discovers natural groups in data.
  • 3Decision trees are intuitive but overfit easily; random forests fix this by combining hundreds of trees trained on random subsets of data.
  • 4Accuracy alone can be misleading — precision, recall, and F1 score give you a complete picture of model performance, especially with imbalanced data.
  • 5Overfitting (memorizing training data) is the most common ML failure mode — combat it with more data, regularization, simpler models, and cross-validation.
  • 6Cross-validation evaluates models on multiple data splits, giving you a reliable performance estimate instead of a single potentially misleading number.
  • 7Scikit-learn provides a consistent API for the entire ML workflow: preprocessing, model training, evaluation, and hyperparameter tuning.

Test Your Understanding

Module Assessment

5 questions · Score 70% or higher to complete this module

You can retake the quiz as many times as you need. Your best score is saved.

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.