Deep Learning Fundamentals
Neural networks, backpropagation, gradient descent. PyTorch introduction.
Deep learning is the branch of machine learning responsible for the AI revolution. From ChatGPT to self-driving cars to protein structure prediction, neural networks power the most impressive AI systems in the world. This module takes you inside the neural network — how data flows through it, how it learns from mistakes, and how to build one yourself with PyTorch.
What Is a Neural Network?
A neural network is a mathematical function composed of layers of interconnected nodes (neurons). Each neuron receives inputs, multiplies them by learned weights, adds a bias, and passes the result through an activation function. Stack enough of these layers together and you get a system capable of learning extraordinarily complex patterns.
The name "neural network" comes from a loose analogy to biological neurons, but modern neural networks are really just chains of matrix multiplications and nonlinear functions. Don't let the biological metaphor confuse you — think of it as a flexible mathematical model that can approximate almost any function given enough data and parameters.
Architecture of a Simple Neural Network
Structure of a feedforward neural network:
Input Layer Hidden Layer(s) Output Layer
(your features) (learned features) (predictions)
x1 ──┐
├──→ [h1] ──┐
x2 ──┤ ├──→ [h4] ──┐
├──→ [h2] ──┤ ├──→ [output]
x3 ──┤ ├──→ [h5] ──┘
├──→ [h3] ──┘
x4 ──┘
Each arrow = a weight (learned parameter)
Each node = weighted sum + activation function
"Deep" learning = many hidden layersThe Forward Pass
The forward pass is how data flows through the network from input to output. Each layer transforms the data by applying weights, biases, and activation functions. The output of one layer becomes the input to the next.
Step by Step
Input
Raw data enters the network. For an image, this might be pixel values. For tabular data, these are your feature values. The input layer has one neuron per feature.
Weighted Sum
Each neuron in the next layer computes a weighted sum of its inputs: z = w1*x1 + w2*x2 + ... + wn*xn + bias. The weights determine how much each input matters. The bias shifts the result.
Activation Function
The weighted sum passes through a nonlinear activation function: output = activation(z). Without activation functions, stacking layers would just produce another linear function — the nonlinearity is what gives neural networks their expressive power.
Repeat Through Layers
Steps 2 and 3 repeat for each layer. The output of one layer becomes the input to the next. The final layer produces the network's prediction.
Backpropagation: How Neural Networks Learn
Backpropagation is the algorithm that makes learning possible. After each forward pass, the network compares its prediction to the correct answer and calculates an error (loss). Backpropagation then flows this error backwards through the network, computing how much each weight contributed to the mistake.
The Intuition
Imagine you're playing darts blindfolded. Someone tells you that your dart landed 3 inches to the left and 2 inches too high. You adjust your aim accordingly for the next throw. Backpropagation works the same way — it tells each weight in the network how to adjust to reduce the error.
Mathematically, backpropagation uses the chain rule of calculus to compute the gradient (direction and magnitude of change) of the loss function with respect to each weight. These gradients tell the optimizer which direction to nudge each weight and by how much.
Gradient Descent: The Optimization Algorithm
Gradient descent is how the network actually updates its weights based on the gradients computed by backpropagation. Think of it as standing on a hilly landscape in fog — you can't see the lowest point, but you can feel the slope beneath your feet. Gradient descent takes a step downhill in the steepest direction.
Variants of Gradient Descent
| Variant | How It Works | Trade-off |
|---|---|---|
| Batch Gradient Descent | Computes gradient using the entire training dataset | Precise but slow for large datasets |
| Stochastic (SGD) | Updates weights after each individual example | Fast but noisy — updates can be erratic |
| Mini-batch SGD | Computes gradient on small batches (e.g., 32 or 64 examples) | The practical default — balances speed and stability |
The learning rate is a critical hyperparameter that controls the step size. Too large and you overshoot the minimum; too small and training takes forever (or gets stuck). Modern optimizers like Adam (Adaptive Moment Estimation) automatically adjust the learning rate per parameter, which is why Adam is the default choice for most deep learning tasks.
Activation Functions
Activation functions introduce nonlinearity into the network. Without them, a deep network would behave identically to a single linear transformation, no matter how many layers it has.
| Function | Formula | Range | When to Use |
|---|---|---|---|
| ReLU | max(0, x) | [0, infinity) | Default for hidden layers. Simple, fast, works well in practice. |
| Sigmoid | 1 / (1 + e^-x) | (0, 1) | Output layer for binary classification (probability). |
| Softmax | e^xi / sum(e^xj) | (0, 1), sums to 1 | Output layer for multi-class classification (class probabilities). |
| GELU | x * phi(x) | approx (-0.17, infinity) | Used in transformers (GPT, BERT). Smoother than ReLU. |
Loss Functions
The loss function measures how wrong the network's predictions are. It's the number that backpropagation works to minimize. Choosing the right loss function is critical — it defines what "getting better" means for your model.
| Loss Function | Task Type | What It Does |
|---|---|---|
| Mean Squared Error (MSE) | Regression | Average of squared differences between predicted and actual values. Penalizes large errors heavily. |
| Binary Cross-Entropy | Binary classification | Measures difference between predicted probabilities and true binary labels (0 or 1). |
| Cross-Entropy | Multi-class classification | Measures difference between predicted class probability distribution and the true class. Used with softmax output. |
PyTorch: Building Neural Networks in Code
PyTorch is the dominant deep learning framework, used by both researchers and industry practitioners. Developed by Meta AI, it provides a Pythonic, flexible API for building and training neural networks. As of 2026, PyTorch is the default choice for most deep learning projects, having overtaken TensorFlow in both academic and industry adoption.
Core PyTorch Concepts
Tensors — the fundamental data structure:
import torch
# Tensors are like NumPy arrays but can run on GPUs
x = torch.tensor([1.0, 2.0, 3.0])
W = torch.randn(3, 2) # Random 3x2 matrix (weights)
b = torch.zeros(2) # Bias vector
# Matrix multiplication + bias (a single layer's computation)
output = x @ W + b
print(output) # tensor([...]) — two output values
# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x_gpu = x.to(device)
print(f"Tensor is on: {x_gpu.device}")Autograd — automatic differentiation:
import torch
# Tell PyTorch to track operations for gradient computation
x = torch.tensor(3.0, requires_grad=True)
# A simple computation
y = x ** 2 + 2 * x + 1 # y = x² + 2x + 1
# Compute gradients automatically
y.backward()
# dy/dx = 2x + 2 = 2(3) + 2 = 8
print(f"Gradient: {x.grad}") # tensor(8.)
# This is the engine behind backpropagation — PyTorch
# tracks every operation and can differentiate through
# arbitrarily complex computation graphsBuilding a Neural Network with PyTorch
A complete neural network for classification:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# --- 1. Prepare data ---
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Convert to PyTorch tensors
X_train = torch.FloatTensor(X_train)
X_test = torch.FloatTensor(X_test)
y_train = torch.LongTensor(y_train)
y_test = torch.LongTensor(y_test)
# --- 2. Define the network ---
class IrisNet(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(4, 32), # 4 input features → 32 neurons
nn.ReLU(), # Activation function
nn.Linear(32, 16), # 32 → 16 neurons
nn.ReLU(),
nn.Linear(16, 3), # 16 → 3 output classes
)
def forward(self, x):
return self.layers(x)
model = IrisNet()
print(f"Parameters: {sum(p.numel() for p in model.parameters())}")
# --- 3. Set up training ---
criterion = nn.CrossEntropyLoss() # Loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
# --- 4. Training loop ---
for epoch in range(100):
# Forward pass
outputs = model(X_train)
loss = criterion(outputs, y_train)
# Backward pass and weight update
optimizer.zero_grad() # Reset gradients from last step
loss.backward() # Compute gradients (backpropagation)
optimizer.step() # Update weights (gradient descent)
if (epoch + 1) % 20 == 0:
print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
# --- 5. Evaluate ---
model.eval() # Switch to evaluation mode
with torch.no_grad(): # No need to track gradients
predictions = model(X_test).argmax(dim=1)
accuracy = (predictions == y_test).float().mean()
print(f"Test accuracy: {accuracy:.2%}")Convolutional Neural Networks (CNNs) for Images
Standard neural networks treat each input feature independently. For images, this ignores the spatial structure — the fact that nearby pixels are related. Convolutional Neural Networks solve this by using small learnable filters that slide across the image, detecting local patterns like edges, textures, and shapes.
How CNNs Work
- Convolutional layers apply small filters (e.g., 3x3 pixels) that scan across the image. Each filter learns to detect a specific pattern — early layers detect edges and colors, deeper layers detect complex features like eyes or wheels.
- Pooling layers reduce the spatial dimensions (e.g., from 32x32 to 16x16) by taking the maximum or average value in each region. This reduces computation and makes the network more robust to small shifts in the input.
- Fully connected layers at the end combine the extracted features into a final classification decision.
CNNs power image classification, object detection, medical imaging analysis, and many computer vision tasks. Landmark architectures include ResNet, EfficientNet, and Vision Transformers (ViTs), which apply the transformer architecture from NLP to images and now achieve state-of-the-art results on many benchmarks.
Recurrent Neural Networks (RNNs) and LSTMs for Sequences
Standard neural networks process each input independently. But many tasks involve sequences where order matters — text, time series, audio, DNA sequences. Recurrent Neural Networks handle this by maintaining a hidden state that carries information from previous steps.
The Vanishing Gradient Problem
Basic RNNs struggle with long sequences because gradients either vanish (become too small to cause learning) or explode (become too large and destabilize training) as they propagate through many time steps. Long Short-Term Memory (LSTM) networks solve this with a gating mechanism that controls what information to remember, forget, and output at each step.
Why Deep Learning Works: The Universal Approximation Theorem
The theoretical foundation for neural networks is the universal approximation theorem. In plain language, it states that a neural network with at least one hidden layer and a sufficient number of neurons can approximate any continuous function to any desired accuracy.
Think of it this way: any pattern that exists in your data — no matter how complex — can theoretically be learned by a neural network, provided it has enough capacity (neurons and layers) and enough training data. This is why deep learning works on everything from language to images to protein structures — it's a universal pattern learner.
The practical caveats are important though: the theorem says such a networkexists, but not that gradient descent will find it efficiently. Training deep networks requires careful architecture design, good hyperparameters, sufficient data, and appropriate regularization. The theorem guarantees the ceiling is high; engineering determines how close you get.
When to Use Deep Learning vs. Classical ML
| Use Deep Learning When... | Use Classical ML When... |
|---|---|
| You have large amounts of data (tens of thousands+ examples) | You have small-to-medium datasets (hundreds to low thousands) |
| The data is unstructured (images, text, audio) | The data is structured/tabular (spreadsheets, databases) |
| You need to learn features automatically from raw data | You can define meaningful features manually |
| You have GPU compute resources available | You need fast training and inference on CPUs |
Resources
Neural Networks: Zero to Hero
Andrej Karpathy
A masterclass in building neural networks from scratch. Karpathy (former Tesla AI director) walks through backpropagation, GPT-style language models, and more — all coded from the ground up in Python.
Practical Deep Learning for Coders
fast.ai (Jeremy Howard)
A top-down, practical approach to deep learning. Start building working models from lesson one, then gradually peel back the abstractions to understand how everything works.
Neural Networks (3Blue1Brown)
Grant Sanderson
Beautiful visual explanations of how neural networks learn. Covers forward propagation, backpropagation, and gradient descent with stunning animations that build deep intuition.
PyTorch Tutorials
PyTorch Team
Official PyTorch tutorials covering everything from tensors and autograd basics to building CNNs, RNNs, and transformer models. The 60-minute blitz is a great starting point.
Key Takeaways
- 1A neural network is a chain of matrix multiplications and nonlinear activation functions that learns patterns from data through iterative weight adjustments.
- 2The forward pass computes predictions by flowing data through layers; backpropagation flows errors backwards to compute how each weight should change.
- 3Gradient descent updates weights in the direction that reduces the loss — the Adam optimizer is the practical default for most deep learning tasks.
- 4ReLU is the standard activation function for hidden layers; sigmoid is for binary output; softmax is for multi-class output. Choose your loss function to match the task type.
- 5PyTorch is the dominant deep learning framework — its training loop pattern (forward pass, compute loss, backward, step, zero gradients) applies to any architecture.
- 6CNNs excel at image tasks by learning spatial features through convolutional filters; transformers have largely replaced RNNs for sequence tasks due to parallel processing and better long-range modeling.
- 7The universal approximation theorem guarantees neural networks can learn any pattern in theory — but practical success depends on architecture, data, and training choices.
Test Your Understanding
Module Assessment
5 questions · Score 70% or higher to complete this module
You can retake the quiz as many times as you need. Your best score is saved.