Introduction to Machine Learning
Supervised vs unsupervised learning. Regression, classification, clustering with scikit-learn.
Machine learning is the engine behind most modern AI. Rather than being explicitly programmed with rules, ML systems learn patterns from data and use those patterns to make predictions or decisions. This module covers the core concepts, algorithms, and practical tools you need to understand how machine learning works and start building your own models.
What Is Machine Learning?
Traditional programming follows a simple formula: you give the computer rules and data, and it produces answers. Machine learning flips this — you give the computer data and answers (labels), and it learns the rules on its own. Those learned rules can then be applied to new, unseen data.
A useful mental model: think of ML as pattern recognition at scale. Humans are great at recognizing patterns — faces, voices, handwriting — but we struggle with millions of data points across hundreds of dimensions. Machine learning excels exactly where human intuition breaks down.
Supervised vs. Unsupervised Learning
The two foundational paradigms of machine learning differ in one key way: whether or not the training data comes with labels (known answers).
| Aspect | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Training data | Labeled — each input has a known correct output | Unlabeled — no correct answers provided |
| Goal | Learn the mapping from inputs to outputs | Discover hidden structure or patterns in data |
| Example tasks | Spam detection, price prediction, medical diagnosis | Customer segmentation, anomaly detection, topic modeling |
| Key algorithms | Linear regression, decision trees, SVMs, neural networks | K-means clustering, DBSCAN, PCA, autoencoders |
Regression: Predicting Numbers
Regression is a supervised learning task where the goal is to predict a continuous numeric value. If the answer to your question is a number on a spectrum, you likely need regression.
Examples of Regression Problems
- Predicting house prices based on square footage, location, and features
- Estimating a customer's lifetime value
- Forecasting next quarter's revenue
- Predicting how long a delivery will take
Linear Regression: The Simplest Model
Linear regression fits a straight line (or hyperplane in higher dimensions) through your data. The line minimizes the distance between predicted values and actual values. Despite its simplicity, linear regression is surprisingly effective for many real-world problems and serves as the foundation for understanding more complex models.
Linear regression with scikit-learn:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
# Example: predicting house prices
# X = features (square footage, bedrooms, age)
# y = price in thousands
X = np.array([[1400, 3, 10], [1600, 3, 5], [1700, 4, 15],
[1875, 4, 8], [1100, 2, 20], [2200, 5, 3]])
y = np.array([245, 312, 279, 308, 199, 420])
# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
print(f"Predicted prices: {predictions}")
print(f"Model score (R²): {model.score(X_test, y_test):.2f}")Classification: Predicting Categories
Classification is a supervised learning task where the goal is to assign inputs to discrete categories. If the answer to your question is one of several possible labels, you need classification.
Examples of Classification Problems
- Binary classification (two classes): spam or not spam, fraud or legitimate, tumor is benign or malignant
- Multi-class classification (more than two): categorizing emails into folders, identifying animal species from photos, sentiment analysis (positive, negative, neutral)
Logistic Regression for Classification
Despite the name, logistic regression is a classification algorithm. It predicts the probability that an input belongs to a particular class. If the probability exceeds a threshold (usually 0.5), the input is assigned to that class.
Binary classification with scikit-learn:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
# Load a real dataset — breast cancer diagnosis
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Train a logistic regression classifier
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train)
# Evaluate
accuracy = clf.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}") # Typically ~95-97%
# Predict on new data
prediction = clf.predict(X_test[:1])
probability = clf.predict_proba(X_test[:1])
print(f"Prediction: {'Benign' if prediction[0] else 'Malignant'}")
print(f"Confidence: {probability[0].max():.2%}")Clustering: Finding Groups
Clustering is an unsupervised learning task where the algorithm finds natural groupings in your data without being told what the groups should be. The algorithm discovers structure on its own.
Examples of Clustering
- Segmenting customers into groups based on purchasing behavior
- Grouping similar news articles together
- Identifying distinct user behavior patterns on a website
- Detecting anomalies (data points that don't belong to any cluster)
K-means clustering example:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np
# Example: customer segmentation based on annual spend and frequency
customers = np.array([
[15000, 52], [22000, 48], [18000, 55], # High-value frequent
[2000, 12], [1500, 8], [3000, 15], # Low-value infrequent
[8000, 45], [9500, 50], [7000, 40], # Medium-value frequent
])
# Always scale features before clustering
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)
# Fit K-means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(customers_scaled)
print(f"Cluster assignments: {labels}")
# Output: each customer assigned to cluster 0, 1, or 2
# Examine cluster centers (inverse-transform to original scale)
centers = scaler.inverse_transform(kmeans.cluster_centers_)
for i, center in enumerate(centers):
print(f"Cluster {i}: Avg spend=${center[0]:.0f}, "
f"Avg frequency={center[1]:.0f} visits/year")Key Algorithms Explained Intuitively
Decision Trees
A decision tree makes predictions by asking a series of yes/no questions about the data, forming a tree-like flowchart. At each node, the algorithm chooses the question that best separates the data. Think of it like a game of 20 questions — each question narrows down the possibilities.
Strengths: Easy to understand and visualize. Handles both numerical and categorical data. Requires minimal data preprocessing.
Weaknesses: Prone to overfitting — a tree can memorize the training data instead of learning general patterns. Individual trees can be unstable (small changes in data can produce very different trees).
Random Forests
A random forest solves the overfitting problem of decision trees by building hundreds or thousands of trees, each trained on a random subset of the data and features. The final prediction is determined by majority vote (classification) or average (regression) across all trees.
The "wisdom of crowds" effect makes random forests remarkably accurate and robust. They are one of the most reliable out-of-the-box algorithms for tabular data and remain a top choice even in the age of deep learning. In many Kaggle competitions, gradient-boosted trees (a related ensemble method) still outperform neural networks on structured data.
Support Vector Machines (SVMs)
SVMs find the optimal boundary (hyperplane) that separates different classes with the widest possible margin. Imagine scattering red and blue dots on a table — an SVM finds the line that separates them with the most breathing room on each side.
The "kernel trick" allows SVMs to handle data that isn't linearly separable by projecting it into a higher-dimensional space where a clean boundary exists. SVMs work well with small-to-medium datasets and high-dimensional data (like text classification), but can be slow to train on very large datasets.
Comparing algorithms with scikit-learn:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
# Load the classic Iris dataset
data = load_iris()
X, y = data.data, data.target
# Compare three algorithms using 5-fold cross-validation
models = {
"Decision Tree": DecisionTreeClassifier(random_state=42),
"Random Forest": RandomForestClassifier(
n_estimators=100, random_state=42
),
"SVM": SVC(kernel="rbf", random_state=42),
}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"{name}: {scores.mean():.2%} (+/- {scores.std():.2%})")Evaluation Metrics: How Good Is Your Model?
Accuracy alone can be misleading. If 99% of emails are legitimate, a model that always predicts "not spam" is 99% accurate but completely useless. You need metrics that capture different aspects of model performance.
| Metric | What It Measures | When to Use |
|---|---|---|
| Accuracy | Percentage of correct predictions overall | When classes are balanced (roughly equal numbers) |
| Precision | Of all positive predictions, how many were actually positive? | When false positives are costly (e.g., flagging legitimate transactions as fraud) |
| Recall | Of all actual positives, how many did the model catch? | When false negatives are costly (e.g., missing a disease diagnosis) |
| F1 Score | Harmonic mean of precision and recall — balances both | When you need a single metric that accounts for both false positives and false negatives |
Computing evaluation metrics:
from sklearn.metrics import (
accuracy_score, precision_score,
recall_score, f1_score, classification_report
)
# Assume y_test and y_pred are your true labels and predictions
# Example with binary classification results
y_test = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(f"Precision: {precision_score(y_test, y_pred):.2%}")
print(f"Recall: {recall_score(y_test, y_pred):.2%}")
print(f"F1 Score: {f1_score(y_test, y_pred):.2%}")
# Full classification report with per-class metrics
print("\n", classification_report(y_test, y_pred,
target_names=["Negative", "Positive"]))Overfitting vs. Underfitting
This is arguably the single most important concept in machine learning. Getting this balance right is what separates a model that works in the real world from one that only works on your training data.
Overfitting (Too Complex)
The model memorizes the training data, including its noise and random fluctuations. It performs brilliantly on training data but poorly on new, unseen data.
Analogy: A student who memorizes every practice exam answer but can't solve a new problem they haven't seen before.
Signs: High training accuracy, low test accuracy. Large gap between training and validation performance.
Underfitting (Too Simple)
The model is too simple to capture the underlying patterns in the data. It performs poorly on both training data and new data.
Analogy: A student who only learned the chapter titles but never read the material — they can't answer any questions well.
Signs: Low training accuracy and low test accuracy. The model fails to capture obvious patterns.
Strategies to Combat Overfitting
- More training data — the most effective remedy when available
- Regularization — adds a penalty for model complexity (L1/L2 regularization)
- Simpler models — fewer parameters, shallower trees, fewer features
- Early stopping — stop training when validation performance starts declining
- Cross-validation — evaluate on multiple data splits (covered next)
- Dropout (for neural networks) — randomly deactivate neurons during training
Cross-Validation: Reliable Model Evaluation
A single train/test split can be misleading — you might get lucky or unlucky with how the data was divided. Cross-validation solves this by splitting the data multiple ways and averaging the results.
K-Fold Cross-Validation
The most common approach is k-fold cross-validation (typically k=5 or k=10). The data is divided into k equal parts (folds). The model is trained k times, each time using a different fold as the test set and the remaining k-1 folds as training data. The final score is the average across all k runs.
K-fold cross-validation in practice:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
# Load dataset
data = load_wine()
X, y = data.data, data.target
# Create a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
# 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
print(f"Fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.2%}")
print(f"Std deviation: {scores.std():.2%}")
# A small std deviation means the model performs consistently
# A large std deviation suggests the model is sensitive to the data splitThe Scikit-learn Ecosystem
Scikit-learn (sklearn) is the most widely used Python library for classical machine learning. Its consistent API, excellent documentation, and comprehensive collection of algorithms make it the standard tool for ML practitioners. As of 2026, it remains the go-to library for tabular data and traditional ML tasks — deep learning frameworks like PyTorch handle neural networks, but scikit-learn dominates everything else.
Core scikit-learn Modules
| Module | Purpose | Key Classes |
|---|---|---|
| sklearn.model_selection | Splitting data, cross-validation, hyperparameter tuning | train_test_split, cross_val_score, GridSearchCV |
| sklearn.preprocessing | Scaling, encoding, transforming features | StandardScaler, LabelEncoder, OneHotEncoder |
| sklearn.metrics | Evaluating model performance | accuracy_score, f1_score, classification_report |
| sklearn.ensemble | Ensemble methods that combine multiple models | RandomForestClassifier, GradientBoostingClassifier |
| sklearn.pipeline | Chaining preprocessing and modeling steps | Pipeline, make_pipeline |
A complete scikit-learn pipeline:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
train_test_split, GridSearchCV
)
from sklearn.datasets import load_wine
from sklearn.metrics import classification_report
# Load data
data = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Build a pipeline: scale features → train model
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", RandomForestClassifier(random_state=42)),
])
# Hyperparameter tuning with grid search
param_grid = {
"clf__n_estimators": [50, 100, 200],
"clf__max_depth": [None, 10, 20],
}
search = GridSearchCV(pipe, param_grid, cv=5, scoring="f1_macro")
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.2%}")
# Evaluate on test set
y_pred = search.predict(X_test)
print(classification_report(y_test, y_pred,
target_names=data.target_names))Resources
Machine Learning Specialization
Andrew Ng (Stanford / Coursera)
The gold standard introduction to machine learning. Andrew Ng's updated 2022 specialization covers supervised learning, unsupervised learning, and best practices with Python and scikit-learn.
StatQuest: Machine Learning
Josh Starmer
Brilliantly clear, visual explanations of ML concepts including decision trees, random forests, SVMs, cross-validation, and evaluation metrics. No prerequisites needed.
Scikit-learn User Guide
Scikit-learn Contributors
The official scikit-learn documentation with detailed explanations and code examples for every algorithm and utility in the library.
Kaggle Learn: Intro to Machine Learning
Kaggle
Free, hands-on micro-course that teaches ML fundamentals with real datasets in an interactive notebook environment. Great for learning by doing.
Key Takeaways
- 1Machine learning learns patterns from data rather than following explicit rules — supervised learning uses labeled data, unsupervised learning finds hidden structure in unlabeled data.
- 2Regression predicts continuous numbers (prices, quantities), classification predicts discrete categories (spam/not spam), and clustering discovers natural groups in data.
- 3Decision trees are intuitive but overfit easily; random forests fix this by combining hundreds of trees trained on random subsets of data.
- 4Accuracy alone can be misleading — precision, recall, and F1 score give you a complete picture of model performance, especially with imbalanced data.
- 5Overfitting (memorizing training data) is the most common ML failure mode — combat it with more data, regularization, simpler models, and cross-validation.
- 6Cross-validation evaluates models on multiple data splits, giving you a reliable performance estimate instead of a single potentially misleading number.
- 7Scikit-learn provides a consistent API for the entire ML workflow: preprocessing, model training, evaluation, and hyperparameter tuning.
Test Your Understanding
Module Assessment
5 questions · Score 70% or higher to complete this module
You can retake the quiz as many times as you need. Your best score is saved.