Advanced60 minModule 5 of 6

Deep Learning Fundamentals

Neural networks, backpropagation, gradient descent. PyTorch introduction.

Deep learning is the branch of machine learning responsible for the AI revolution. From ChatGPT to self-driving cars to protein structure prediction, neural networks power the most impressive AI systems in the world. This module takes you inside the neural network — how data flows through it, how it learns from mistakes, and how to build one yourself with PyTorch.

What Is a Neural Network?

A neural network is a mathematical function composed of layers of interconnected nodes (neurons). Each neuron receives inputs, multiplies them by learned weights, adds a bias, and passes the result through an activation function. Stack enough of these layers together and you get a system capable of learning extraordinarily complex patterns.

The name "neural network" comes from a loose analogy to biological neurons, but modern neural networks are really just chains of matrix multiplications and nonlinear functions. Don't let the biological metaphor confuse you — think of it as a flexible mathematical model that can approximate almost any function given enough data and parameters.

Architecture of a Simple Neural Network

Structure of a feedforward neural network:

Input Layer        Hidden Layer(s)       Output Layer
(your features)    (learned features)    (predictions)

  x1 ──┐
       ├──→ [h1] ──┐
  x2 ──┤          ├──→ [h4] ──┐
       ├──→ [h2] ──┤          ├──→ [output]
  x3 ──┤          ├──→ [h5] ──┘
       ├──→ [h3] ──┘
  x4 ──┘

Each arrow = a weight (learned parameter)
Each node = weighted sum + activation function
"Deep" learning = many hidden layers
Why "Deep" Learning?
A neural network with one or two hidden layers is just a neural network. When you stack many layers — dozens, hundreds, or even thousands — it becomes a "deep" neural network, hence "deep learning." The depth allows the network to learn hierarchical representations: early layers learn simple features (edges, textures), middle layers combine them into complex patterns (shapes, objects), and later layers compose those into high-level concepts (faces, sentences, decisions).

The Forward Pass

The forward pass is how data flows through the network from input to output. Each layer transforms the data by applying weights, biases, and activation functions. The output of one layer becomes the input to the next.

Step by Step

1

Input

Raw data enters the network. For an image, this might be pixel values. For tabular data, these are your feature values. The input layer has one neuron per feature.

2

Weighted Sum

Each neuron in the next layer computes a weighted sum of its inputs: z = w1*x1 + w2*x2 + ... + wn*xn + bias. The weights determine how much each input matters. The bias shifts the result.

3

Activation Function

The weighted sum passes through a nonlinear activation function: output = activation(z). Without activation functions, stacking layers would just produce another linear function — the nonlinearity is what gives neural networks their expressive power.

4

Repeat Through Layers

Steps 2 and 3 repeat for each layer. The output of one layer becomes the input to the next. The final layer produces the network's prediction.

Backpropagation: How Neural Networks Learn

Backpropagation is the algorithm that makes learning possible. After each forward pass, the network compares its prediction to the correct answer and calculates an error (loss). Backpropagation then flows this error backwards through the network, computing how much each weight contributed to the mistake.

The Intuition

Imagine you're playing darts blindfolded. Someone tells you that your dart landed 3 inches to the left and 2 inches too high. You adjust your aim accordingly for the next throw. Backpropagation works the same way — it tells each weight in the network how to adjust to reduce the error.

Mathematically, backpropagation uses the chain rule of calculus to compute the gradient (direction and magnitude of change) of the loss function with respect to each weight. These gradients tell the optimizer which direction to nudge each weight and by how much.

You Don't Need to Derive the Math
Frameworks like PyTorch compute backpropagation automatically. You define the forward pass and the loss function, and PyTorch handles the gradient computation through its automatic differentiation engine (autograd). Understanding the concept is valuable; implementing the calculus by hand is not necessary.

Gradient Descent: The Optimization Algorithm

Gradient descent is how the network actually updates its weights based on the gradients computed by backpropagation. Think of it as standing on a hilly landscape in fog — you can't see the lowest point, but you can feel the slope beneath your feet. Gradient descent takes a step downhill in the steepest direction.

Variants of Gradient Descent

VariantHow It WorksTrade-off
Batch Gradient DescentComputes gradient using the entire training datasetPrecise but slow for large datasets
Stochastic (SGD)Updates weights after each individual exampleFast but noisy — updates can be erratic
Mini-batch SGDComputes gradient on small batches (e.g., 32 or 64 examples)The practical default — balances speed and stability

The learning rate is a critical hyperparameter that controls the step size. Too large and you overshoot the minimum; too small and training takes forever (or gets stuck). Modern optimizers like Adam (Adaptive Moment Estimation) automatically adjust the learning rate per parameter, which is why Adam is the default choice for most deep learning tasks.

Activation Functions

Activation functions introduce nonlinearity into the network. Without them, a deep network would behave identically to a single linear transformation, no matter how many layers it has.

FunctionFormulaRangeWhen to Use
ReLUmax(0, x)[0, infinity)Default for hidden layers. Simple, fast, works well in practice.
Sigmoid1 / (1 + e^-x)(0, 1)Output layer for binary classification (probability).
Softmaxe^xi / sum(e^xj)(0, 1), sums to 1Output layer for multi-class classification (class probabilities).
GELUx * phi(x)approx (-0.17, infinity)Used in transformers (GPT, BERT). Smoother than ReLU.
The Practical Default
For most projects, use ReLU (or its variant GELU) in hidden layers. Use sigmoid for binary classification output and softmax for multi-class classification output. Do not use sigmoid or tanh in hidden layers of deep networks — they suffer from the vanishing gradient problem, where gradients become extremely small in early layers, effectively stopping learning.

Loss Functions

The loss function measures how wrong the network's predictions are. It's the number that backpropagation works to minimize. Choosing the right loss function is critical — it defines what "getting better" means for your model.

Loss FunctionTask TypeWhat It Does
Mean Squared Error (MSE)RegressionAverage of squared differences between predicted and actual values. Penalizes large errors heavily.
Binary Cross-EntropyBinary classificationMeasures difference between predicted probabilities and true binary labels (0 or 1).
Cross-EntropyMulti-class classificationMeasures difference between predicted class probability distribution and the true class. Used with softmax output.

PyTorch: Building Neural Networks in Code

PyTorch is the dominant deep learning framework, used by both researchers and industry practitioners. Developed by Meta AI, it provides a Pythonic, flexible API for building and training neural networks. As of 2026, PyTorch is the default choice for most deep learning projects, having overtaken TensorFlow in both academic and industry adoption.

Core PyTorch Concepts

Tensors — the fundamental data structure:

import torch

# Tensors are like NumPy arrays but can run on GPUs
x = torch.tensor([1.0, 2.0, 3.0])
W = torch.randn(3, 2)   # Random 3x2 matrix (weights)
b = torch.zeros(2)       # Bias vector

# Matrix multiplication + bias (a single layer's computation)
output = x @ W + b
print(output)  # tensor([...]) — two output values

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x_gpu = x.to(device)
print(f"Tensor is on: {x_gpu.device}")

Autograd — automatic differentiation:

import torch

# Tell PyTorch to track operations for gradient computation
x = torch.tensor(3.0, requires_grad=True)

# A simple computation
y = x ** 2 + 2 * x + 1  # y = x² + 2x + 1

# Compute gradients automatically
y.backward()

# dy/dx = 2x + 2 = 2(3) + 2 = 8
print(f"Gradient: {x.grad}")  # tensor(8.)

# This is the engine behind backpropagation — PyTorch
# tracks every operation and can differentiate through
# arbitrarily complex computation graphs

Building a Neural Network with PyTorch

A complete neural network for classification:

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# --- 1. Prepare data ---
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to PyTorch tensors
X_train = torch.FloatTensor(X_train)
X_test = torch.FloatTensor(X_test)
y_train = torch.LongTensor(y_train)
y_test = torch.LongTensor(y_test)

# --- 2. Define the network ---
class IrisNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(4, 32),     # 4 input features → 32 neurons
            nn.ReLU(),            # Activation function
            nn.Linear(32, 16),    # 32 → 16 neurons
            nn.ReLU(),
            nn.Linear(16, 3),     # 16 → 3 output classes
        )

    def forward(self, x):
        return self.layers(x)

model = IrisNet()
print(f"Parameters: {sum(p.numel() for p in model.parameters())}")

# --- 3. Set up training ---
criterion = nn.CrossEntropyLoss()   # Loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)

# --- 4. Training loop ---
for epoch in range(100):
    # Forward pass
    outputs = model(X_train)
    loss = criterion(outputs, y_train)

    # Backward pass and weight update
    optimizer.zero_grad()   # Reset gradients from last step
    loss.backward()         # Compute gradients (backpropagation)
    optimizer.step()        # Update weights (gradient descent)

    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

# --- 5. Evaluate ---
model.eval()  # Switch to evaluation mode
with torch.no_grad():  # No need to track gradients
    predictions = model(X_test).argmax(dim=1)
    accuracy = (predictions == y_test).float().mean()
    print(f"Test accuracy: {accuracy:.2%}")
The Training Loop Pattern
Every PyTorch training loop follows the same five steps: (1) forward pass to compute predictions, (2) compute the loss, (3) call loss.backward() to compute gradients, (4) call optimizer.step() to update weights, (5) call optimizer.zero_grad() to reset gradients. Once you know this pattern, you can train any neural network architecture.

Convolutional Neural Networks (CNNs) for Images

Standard neural networks treat each input feature independently. For images, this ignores the spatial structure — the fact that nearby pixels are related. Convolutional Neural Networks solve this by using small learnable filters that slide across the image, detecting local patterns like edges, textures, and shapes.

How CNNs Work

  • Convolutional layers apply small filters (e.g., 3x3 pixels) that scan across the image. Each filter learns to detect a specific pattern — early layers detect edges and colors, deeper layers detect complex features like eyes or wheels.
  • Pooling layers reduce the spatial dimensions (e.g., from 32x32 to 16x16) by taking the maximum or average value in each region. This reduces computation and makes the network more robust to small shifts in the input.
  • Fully connected layers at the end combine the extracted features into a final classification decision.

CNNs power image classification, object detection, medical imaging analysis, and many computer vision tasks. Landmark architectures include ResNet, EfficientNet, and Vision Transformers (ViTs), which apply the transformer architecture from NLP to images and now achieve state-of-the-art results on many benchmarks.

Recurrent Neural Networks (RNNs) and LSTMs for Sequences

Standard neural networks process each input independently. But many tasks involve sequences where order matters — text, time series, audio, DNA sequences. Recurrent Neural Networks handle this by maintaining a hidden state that carries information from previous steps.

The Vanishing Gradient Problem

Basic RNNs struggle with long sequences because gradients either vanish (become too small to cause learning) or explode (become too large and destabilize training) as they propagate through many time steps. Long Short-Term Memory (LSTM) networks solve this with a gating mechanism that controls what information to remember, forget, and output at each step.

Transformers Have Largely Replaced RNNs
While RNNs and LSTMs are important to understand historically, transformers (the architecture behind GPT, Claude, and most modern LLMs) have largely replaced them for sequence tasks. Transformers process all positions in parallel using self-attention, making them faster to train and better at capturing long-range dependencies. If you're starting a new sequence project in 2026, you would almost certainly choose a transformer-based approach.

Why Deep Learning Works: The Universal Approximation Theorem

The theoretical foundation for neural networks is the universal approximation theorem. In plain language, it states that a neural network with at least one hidden layer and a sufficient number of neurons can approximate any continuous function to any desired accuracy.

Think of it this way: any pattern that exists in your data — no matter how complex — can theoretically be learned by a neural network, provided it has enough capacity (neurons and layers) and enough training data. This is why deep learning works on everything from language to images to protein structures — it's a universal pattern learner.

The practical caveats are important though: the theorem says such a networkexists, but not that gradient descent will find it efficiently. Training deep networks requires careful architecture design, good hyperparameters, sufficient data, and appropriate regularization. The theorem guarantees the ceiling is high; engineering determines how close you get.

When to Use Deep Learning vs. Classical ML

Use Deep Learning When...Use Classical ML When...
You have large amounts of data (tens of thousands+ examples)You have small-to-medium datasets (hundreds to low thousands)
The data is unstructured (images, text, audio)The data is structured/tabular (spreadsheets, databases)
You need to learn features automatically from raw dataYou can define meaningful features manually
You have GPU compute resources availableYou need fast training and inference on CPUs
Deep Learning Is Not Always Better
For tabular data with well-defined features, gradient-boosted trees (XGBoost, LightGBM) still frequently outperform deep learning in both accuracy and training speed. Don't assume deep learning is always the right choice — start simple, establish a baseline with classical ML, and move to deep learning only if you need it.

Resources

Key Takeaways

  • 1A neural network is a chain of matrix multiplications and nonlinear activation functions that learns patterns from data through iterative weight adjustments.
  • 2The forward pass computes predictions by flowing data through layers; backpropagation flows errors backwards to compute how each weight should change.
  • 3Gradient descent updates weights in the direction that reduces the loss — the Adam optimizer is the practical default for most deep learning tasks.
  • 4ReLU is the standard activation function for hidden layers; sigmoid is for binary output; softmax is for multi-class output. Choose your loss function to match the task type.
  • 5PyTorch is the dominant deep learning framework — its training loop pattern (forward pass, compute loss, backward, step, zero gradients) applies to any architecture.
  • 6CNNs excel at image tasks by learning spatial features through convolutional filters; transformers have largely replaced RNNs for sequence tasks due to parallel processing and better long-range modeling.
  • 7The universal approximation theorem guarantees neural networks can learn any pattern in theory — but practical success depends on architecture, data, and training choices.

Test Your Understanding

Module Assessment

5 questions · Score 70% or higher to complete this module

You can retake the quiz as many times as you need. Your best score is saved.

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.