import torch
# Creating a rank-1 tensor (vector)
= torch.tensor([1, 2, 3, 4])
x
# Creating a rank-2 tensor (matrix) with specific values
= torch.tensor([[1, 2, 3],
y 4, 5, 6]])
[
# Creating tensors with uniform initialization
= torch.zeros(2, 3) # 2×3 tensor filled with zeros
zeros = torch.ones(2, 3) # 2×3 tensor filled with ones
ones
# Creating tensors with random values
= torch.rand(2, 3) # Uniform distribution [0, 1)
rand_uniform = torch.randn(2, 3) # Standard normal distribution rand_normal
8 pytorch
This is an EARLY DRAFT.
This chapter will show you how to use PyTorch to build neural network models.PyTorch has emerged as a powerful framework for deep learning research and applications, developed by Facebook’s AI Research lab (FAIR).
In this chapter, we will systematically explore the fundamental components of PyTorch, beginning with its basic data structure—the tensor—and gradually building toward complete neural network implementations. Our approach will emphasize understanding the underlying mechanisms rather than simply applying pre-built solutions, providing a foundation for both practical applications and further exploration.
8.1 Deep Learning Software Frameworks
In the early days of deep learning, researchers often had to write their own routines for matrix operations, differentiation, and training loops. Over time, dedicated software libraries emerged to make these tasks much easier. Early examples include Theano (from the University of Montreal) and Torch (developed by the broader machine-learning community). These pioneers introduced the idea of automatic differentiation and laid the foundation for modern frameworks.
Today, popular libraries include:
- PyTorch
- TensorFlow
- JAX
- Keras
- PyTorch Lightning (built on PyTorch)
A low-level framework typically gives you considerable control over all the details, such as how the computational graph is built or how gradients are computed. This can be powerful for research and custom applications but may require more code to set up even basic neural networks. JAX and “pure” TensorFlow or PyTorch (without additional abstractions) lean in this direction.
In contrast, high-level frameworks provide a simpler interface that handles many details for you (such as training loops, logging, or distributed computing). Libraries like Keras or PyTorch Lightning let you write less code by wrapping common steps—like setting up optimizers or running epochs—in pre-built routines. This can be appealing for rapid prototyping or for ensuring more standardized code.
In this book, we will use PyTorch and PyTorch Lightning as these are the most popular neural network frameworks at the time of writing. However, the basic concepts are common across frameworks and having learnt PyTorch you will not find it too difficult to a different framework if you have to.
PyTorch is installed by default on Google Colab. You can install it on your machine using
pip install torch
8.2 Tensors: Multi-dimensional arrays
At the heart of PyTorch’s computational framework lies the tensor—a multi-dimensional array that serves as the primary data structure for all operations. While superficially similar to NumPy’s ndarray, PyTorch tensors incorporate crucial additional functionality specifically designed for deep learning workloads.
8.2.1 Tensor Fundamentals
A tensor can be conceptualized as a generalization of vectors and matrices to potentially higher dimensions. The dimensionality of a tensor is referred to as its “rank”:
- A scalar is a rank-0 tensor (single value)
- A vector is a rank-1 tensor (one-dimensional array)
- A matrix is a rank-2 tensor (two-dimensional array)
- Higher-dimensional arrays constitute rank-n tensors
Tensors in PyTorch are instantiated through various factory methods, each serving different initialization needs:
Each tensor possesses several intrinsic attributes that characterize its structure and properties:
# Examining tensor attributes
= torch.tensor([[1, 2, 3], [4, 5, 6]])
x
print(f"Shape: {x.shape}") # Size along each dimension: torch.Size([2, 3])
print(f"Rank/Dimensionality: {x.ndim}") # Number of dimensions: 2
print(f"Data type: {x.dtype}") # Underlying data type: torch.int64
print(f"Number of elements: {x.numel()}") # Total element count: 6
The shape
attribute is particularly important as it defines the tensor’s dimensional structure—in this case, a 2×3 matrix with 2 rows and 3 columns.
8.2.2 Tensor Operations
PyTorch implements a comprehensive set of operations for tensor manipulation, broadly categorized into element-wise operations, reduction operations, and linear algebra operations.
Element-wise Operations: These apply a function independently to each element:
= torch.tensor([1, 2, 3])
a = torch.tensor([4, 5, 6])
b
# Addition, subtraction, multiplication, division
= a + b # torch.tensor([5, 7, 9])
c = a * b # torch.tensor([4, 10, 18]) d
Reduction Operations: These reduce a tensor along specified dimensions:
= torch.tensor([[1, 2, 3], [4, 5, 6]])
x
# Sum of all elements
= x.sum() # tensor(21)
total
# Mean along rows (dimension 1)
= x.mean(dim=1) # tensor([2., 5.]) row_means
Linear Algebra Operations: These perform matrix operations essential for neural networks:
= torch.tensor([[1, 2], [3, 4]])
m1 = torch.tensor([[5, 6], [7, 8]])
m2
# Matrix multiplication
= torch.matmul(m1, m2) # tensor([[19, 22], [43, 50]])
mm # Equivalent syntax using @ operator
= m1 @ m2 # tensor([[19, 22], [43, 50]]) mm_alt
8.2.3 Tensor Memory Management
A critical aspect of working with tensors, particularly when implementing complex neural networks, is understanding how PyTorch manages tensor memory and views.
Reshaping and Views: PyTorch offers two primary mechanisms for rearranging tensor dimensions:
= torch.tensor([[1, 2, 3], [4, 5, 6]])
x
# Creating a new shape
= x.reshape(3, 2) # tensor([[1, 2], [3, 4], [5, 6]])
reshaped
# Creating a view (shares the same memory)
= x.view(3, 2) # tensor([[1, 2], [3, 4], [5, 6]]) view
The distinction between reshape
and view
is subtle but important:
view()
creates a new tensor that shares the same underlying data with the original tensor. It requires the tensor to be contiguous in memory.reshape()
may or may not share memory with the original tensor. If the tensor is not contiguous in memory,reshape()
will create a copy.
This distinction becomes particularly relevant when modifying elements:
# Modifying an element in the view
0, 0] = 99
view[
# Original tensor is also modified because the view shares the same memory
print(x) # tensor([[99, 2, 3], [4, 5, 6]])
Device Management: One of PyTorch’s strengths is its seamless handling of different computational devices, particularly CPUs and GPUs:
# Creating a tensor on CPU (default)
= torch.tensor([1, 2, 3])
cpu_tensor
# Moving to GPU if available
if torch.cuda.is_available():
= cpu_tensor.to('cuda')
gpu_tensor # Operations on gpu_tensor will execute on the GPU
CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model that enables dramatic performance increases by harnessing the power of GPUs. When PyTorch code uses CUDA, computations are offloaded to the GPU, which can perform thousands of operations simultaneously, making it ideal for the parallel nature of neural network calculations.
PyTorch’s device management system handles two critical aspects:
- Data Placement: Tensors can reside on different hardware devices (CPU or GPU). Unlike traditional programming where all data lives in the same memory space, GPUs have their own dedicated memory that’s separate from system RAM. When you move a tensor to a GPU with
.to('cuda')
, PyTorch physically transfers that data from CPU memory to GPU memory. This is a crucial concept because:- GPU memory is typically more limited than system RAM
- Data transfers between CPU and GPU involve overhead
- Operations can only be accelerated if the data actually resides in GPU memory
- Computation Location: Operations execute on the device where the tensors are located. The computational benefits of GPUs only apply to tensors that have been explicitly placed in GPU memory.
When operating on tensors that reside on different devices, PyTorch enforces explicit data movement:
# Attempting operations on tensors from different devices
if torch.cuda.is_available():
= torch.tensor([1, 2, 3])
cpu_tensor = torch.tensor([4, 5, 6], device='cuda')
gpu_tensor
# This would raise a RuntimeError:
# result = cpu_tensor + gpu_tensor
# Correct approaches:
= cpu_tensor.to('cuda') + gpu_tensor # Computation on GPU
result_on_gpu = cpu_tensor + gpu_tensor.to('cpu') # Computation on CPU result_on_cpu
This explicit device management prevents unintended performance bottlenecks from device-to-device transfers and gives developers precise control over where computations occur.
This device management system allows for flexible deployment across heterogeneous computing environments, from development laptops to GPU clusters, with minimal code changes.
8.3 Autograd: Automatic Differentiation
As we have seen in the last chapter, neural networks are trained with variants of stochastic gradient descent, for which we need to compute the gradient of the function defined by the network. We also saw how automatic differentiation software can automatically calculate gradients from the code implementing a function. PyTorch has an integral framework called Autograd tightly integrated with its tensors for this purpose.
8.3.1 Computational Graphs and Automatic Differentiation
At its core, Autograd builds a computational graph that tracks how values are calculated. Think of this graph as a flowchart:
- Each operation (like addition or multiplication) is a box in the flowchart
- The data flows along arrows between these boxes
- The graph is “directed” because data flows in one direction (from inputs to outputs)
- The graph is “acyclic” because data never loops back (no circular dependencies)
When you create a tensor with requires_grad=True
, PyTorch automatically records all operations performed on that tensor, building this flowchart-like structure behind the scenes.
Consider a simple computational example:
= torch.tensor(2.0, requires_grad=True)
x = torch.tensor(3.0, requires_grad=True)
y
# Define a computation
= x**2 + y**3 z
In this example, PyTorch constructs a computational graph where:
- Leaf nodes are the input tensors
x
andy
- Intermediate nodes represent operations (
x**2
,y**3
, and addition) - The output node is the tensor
z
By recording all the computational steps the graphs allows the autograd system to calculate the gradient of the final outputs with respect to the inputs using the chain rule from calculus.
8.3.2 Computing Gradients
Once a computational graph is constructed, PyTorch can automatically compute gradients. It uses the reverse mode automatic differentiation for efficiency as discussed in the previous chapter. The calculation is initiated by calling the .backward()
method on a scalar output tensor:
# Continuing from the previous example
z.backward()
# Accessing gradients
print(f"dz/dx: {x.grad}") # Should be 4.0 (derivative of x^2 is 2x)
print(f"dz/dy: {y.grad}") # Should be 27.0 (derivative of y^3 is 3y^2)
Mathematically, we can verify these gradients:
- For function \(z = x^2 + y^3\)
- \(\frac{\partial z}{\partial x} = 2x = 2 \cdot 2 = 4\)
- \(\frac{\partial z}{\partial y} = 3y^2 = 3 \cdot 3^2 = 27\)
This automatic calculation of gradients is fundamental to training neural networks, as it provides the necessary derivatives for parameter updates without requiring manual implementation of derivative calculations for each operation.
8.3.3 Gradient Behavior and Control
Several aspects of gradient behavior require careful consideration when implementing neural networks:
Gradient Accumulation: By default, PyTorch accumulates gradients when .backward()
is called multiple times:
= torch.tensor(2.0, requires_grad=True)
x
# First operation and backward pass
= x**2
y1
y1.backward()print(f"Gradient after first backward: {x.grad}") # 4.0
# Second operation and backward pass (gradients accumulate)
= x**3
y2
y2.backward()print(f"Gradient after second backward: {x.grad}") # 4.0 + 12.0 = 16.0
This accumulation behavior is particularly valuable for neural networks because many loss functions are inherently sums over data points or batches. For example, when computing the mean squared error over a batch, we’re summing individual squared errors.
This accumulation necessitates explicit gradient zeroing at the beginning of training loops:
# Zeroing gradients
x.grad.zero_()
Detaching Computation: In some scenarios, we may want to temporarily stop gradient tracking:
# Detaching from computation graph
= torch.tensor(2.0, requires_grad=True)
x = x**2
y = y.detach() + 5 # z has no gradient relationship to x z
The detach()
method creates a new tensor that shares the same data but does not track computational history. This is useful for implementing techniques like semi-supervised learning or when working with pre-trained model components.
Gradient Context Management: For evaluation phases where gradient computation is unnecessary and potentially wasteful:
with torch.no_grad():
# No operations inside this block will track gradients
= model(test_data) evaluation_result
The torch.no_grad()
context manager temporarily disables gradient tracking, reducing memory consumption and computational overhead during inference or evaluation.
8.4 Neural Network Building Blocks: Modules and Layers
Neural networks in PyTorch are constructed using a hierarchical system of building blocks, with the nn.Module
class serving as the foundational component. This design facilitates both the organization of model architecture and the management of learnable parameters.
8.4.1 4.1 The nn.Module Class
The nn.Module
class is a versatile container that serves several critical functions:
- Parameter management: Automatically tracks and registers learnable parameters
- Hierarchical composition: Enables nested structures of modules within modules
- Training state: Manages mode switching between training and evaluation
- Serialization: Facilitates saving and loading model states
Every custom neural network component in PyTorch inherits from this base class:
import torch.nn as nn
class SimpleNetwork(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
# Initialize the parent class
super().__init__()
# Define layers as module attributes
self.layer1 = nn.Linear(input_dim, hidden_dim)
self.layer2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
# Define the computation flow
= torch.relu(self.layer1(x))
x = self.layer2(x)
x return x
This structure demonstrates several key aspects of PyTorch’s module system:
- The constructor (
__init__
) establishes the module’s structure by defining its constituent components. - The
forward
method defines the computation flow when the module is called. - Modules can contain other modules (here, each
nn.Linear
layer is itself a module).
When instantiated, this network automatically registers all parameters from its submodules:
= SimpleNetwork(10, 20, 1)
model
# Access parameters
for name, param in model.named_parameters():
print(f"{name}: {param.shape}")
This might output:
layer1.weight: torch.Size([20, 10])
layer1.bias: torch.Size([20])
layer2.weight: torch.Size([1, 20])
layer2.bias: torch.Size([1])
The module system handles parameter registration without requiring manual intervention, simplifying model development and maintenance.
8.4.2 Parameters and Buffers
PyTorch distinguishes between two types of tensor state within modules:
Parameters: Learnable tensors that will be updated during training:
# A parameter is automatically registered with requires_grad=True
self.weight = nn.Parameter(torch.randn(output_size, input_size))
self.bias = nn.Parameter(torch.randn(output_size))
Buffers: Non-learnable tensors that are part of the module’s state:
# A buffer is registered but not included in parameters()
self.register_buffer('running_mean', torch.zeros(num_features))
Buffers typically store statistical information or fixed values that should be serialized (saved and loaded) with the model but not updated by optimizers.
8.4.3 Layer Initialization and the Linear Layer
The nn.Linear
layer, one of the most fundamental components in neural networks, implements a fully connected (dense) layer that performs a linear transformation:
\(y = xW^T + b\)
where \(x\) is the input, \(W\) is the weight matrix, and \(b\) is the bias vector.
Examining its attributes provides insight into PyTorch’s layer design:
= nn.Linear(in_features=10, out_features=5)
linear
# Examining the layer parameters
print(f"Weight shape: {linear.weight.shape}") # torch.Size([5, 10])
print(f"Bias shape: {linear.bias.shape}") # torch.Size([5])
The weight matrix dimensions are [out_features, in_features]
, which may initially seem counterintuitive. This arrangement, however, facilitates efficient batch processing when the input x
has a batch dimension as the first dimension:
= 32
batch_size = torch.randn(batch_size, 10) # 32 samples, each with 10 features
x = linear(x) # Shape: [32, 5] output
The matrix multiplication is effectively \((32 \times 10) \cdot (10 \times 5)^T = (32 \times 5)\), preserving the batch dimension.
8.4.3.1 Parameter Initialization
Parameter initialization plays a crucial role in neural network training dynamics. PyTorch layers initialize parameters with specific schemes by default, but these can be customized:
import torch.nn.init as init
# Custom initialization of a layer's parameters
= nn.Linear(10, 5)
linear # Xavier/Glorot initialization
init.xavier_uniform_(linear.weight) # Initialize biases to zero init.zeros_(linear.bias)
Common initialization methods include:
- Xavier/Glorot initialization: Designed to maintain variance across layers, particularly effective with tanh or sigmoid activations
- Kaiming/He initialization: Adapted for ReLU-based networks
- Constant initialization: Often used for biases, typically with zeros
The choice of initialization strategy can significantly impact training convergence and should be aligned with the activation functions and network architecture.
For linear layers, by default, weights are initialized using Kaiming uniform initialization (also called He initialization), which is designed for ReLU activations. Biases are initialized to zeros.
8.4.4 The Callable Interface: Module Forward Pass
A module object can be called like a function. When it is called with input data, e.g., output = model(input)
, PyTorch invokes a series of methods:
- The
__call__
method of the parentnn.Module
class is triggered __call__
performs several housekeeping operations:- Ensures the module is properly initialized
- Triggers registered hooks (pre-forward hooks)
- Calls the module’s
forward
method with the provided inputs - this is the only method you must implement when creating a custom module - Triggers post-forward hooks
- Returns the result of the
forward
method, which is typically the output tensor(s) produced by the module’s computation
Hooks are customizable callback functions that allow you to “hook into” specific points in PyTorch’s execution flow.
Hooks enable you to:
- Monitor internal values during training
- Modify inputs or outputs on-the-fly
- Collect statistics without changing the model code
- Debug complex networks by inspecting intermediate values
This sophisticated mechanism enables transparent integration of functionality like hook registration, module mode management, and debugging tools without cluttering the user-defined forward
method.
8.4.5 Building Complex Architectures
Complex neural network architectures can be constructed by composing modules in various ways:
Sequential Containers: For linear sequences of layers:
= nn.Sequential(
model 10, 20),
nn.Linear(
nn.ReLU(),20, 15),
nn.Linear(
nn.ReLU(),15, 1)
nn.Linear( )
This nn.Sequential
container is a complete, trainable module without requiring any subclassing of nn.Module
. However, sequential models only support simple linear topologies where each layer feeds directly into the next. For more complicated architectures with you’ll still need to create a custom nn.Module
subclass.
Module Lists and Dictionaries: For more flexible arrangements:
class DynamicNetwork(nn.Module):
def __init__(self, layer_sizes):
super().__init__()
# Create a ModuleList of layers
self.layers = nn.ModuleList([
+1])
nn.Linear(layer_sizes[i], layer_sizes[ifor i in range(len(layer_sizes)-1)
])
def forward(self, x):
for i, layer in enumerate(self.layers):
= layer(x)
x # Apply ReLU to all but the last layer
if i < len(self.layers) - 1:
= torch.relu(x)
x return x
Unlike regular Python lists and dictionaries, nn.ModuleList
and nn.ModuleDict
are special container classes that:
Register parameters: When you add a module to these containers, all its parameters are automatically registered with the parent module, ensuring they’re properly tracked for gradient updates.
Device management: When you move a model to a device (e.g.,
model.to('cuda')
), all modules in these containers are automatically moved too.State management: These containers participate in module state changes like
train()
andeval()
mode, propagating these calls to all contained modules.Serialization: When saving or loading a model, modules in these containers are properly included in the serialization process.
If you used regular Python lists or dictionaries instead, the modules would exist but wouldn’t be recognized as part of the model’s parameter hierarchy, leading to parameters that don’t get updated during training.
These building blocks provide a flexible system for architecting neural networks of arbitrary complexity while maintaining a clean separation between structural definition and computational flow.
8.5 Activation Functions
Activation functions introduce non-linearity into neural networks, a crucial property that enables these models to learn complex patterns and relationships in data. Without non-linear activations, multiple layers would simply collapse into a single linear transformation, regardless of network depth.
PyTorch provides both functional and module-based implementations of activation functions. The functional variants operate directly on tensors, while the module variants can be incorporated into nn.Module
hierarchies:
import torch.nn.functional as F
# Functional form
= torch.randn(5)
x = F.relu(x)
relu_output
# Module form
= nn.ReLU()
relu_layer = relu_layer(x) same_output
The most widely used activation functions include:
ReLU (Rectified Linear Unit): \(\text{ReLU}(x) = \max(0, x)\)
= nn.ReLU()
activation # or functional form: F.relu(x)
ReLU is computationally efficient and mitigates the vanishing gradient problem that plagued earlier neural networks with sigmoid activations. However, it can suffer from the “dying ReLU” problem, where neurons permanently output zero for all inputs.
Leaky ReLU: \(\text{LeakyReLU}(x) = \max(\alpha x, x)\) where \(\alpha\) is a small constant (typically 0.01)
= nn.LeakyReLU(negative_slope=0.01)
activation # or functional form: F.leaky_relu(x, negative_slope=0.01)
Leaky ReLU addresses the dying ReLU problem by allowing a small, non-zero gradient when the unit is not active.
Sigmoid: \(\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}\)
= nn.Sigmoid()
activation # or functional form: F.sigmoid(x)
The sigmoid function maps inputs to the range (0, 1), making it suitable for binary classification output layers. However, it suffers from vanishing gradients for inputs with large magnitudes.
Tanh (Hyperbolic Tangent): \(\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
= nn.Tanh()
activation # or functional form: F.tanh(x)
Similar to sigmoid but mapping to the range (-1, 1), tanh is often preferred over sigmoid for hidden layers due to its zero-centered output.
8.6 Loss Functions
PyTorch also provides commonly used loss functions in its nn
module: PyTorch provides specialized loss functions for different prediction tasks:
Mean Squared Error (MSE): For regression tasks:
\(L_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\)
= nn.MSELoss()
criterion = torch.tensor([0.5, 1.8, 2.5])
predictions = torch.tensor([1.0, 2.0, 2.0])
targets = criterion(predictions, targets) loss
MSE penalizes larger errors more heavily due to the squared term, making it sensitive to outliers.
Binary Cross-Entropy (BCE): For binary classification tasks:
\(L_{\text{BCE}} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]\)
# For predictions already passed through sigmoid
= nn.BCELoss()
criterion = torch.tensor([0.2, 0.7, 0.9])
probabilities = torch.tensor([0.0, 1.0, 1.0])
targets = criterion(probabilities, targets) loss
BCE with Logits: Combines sigmoid activation with BCE for numerical stability:
# For raw logits (pre-sigmoid outputs)
= nn.BCEWithLogitsLoss()
criterion = torch.tensor([1.5, 0.3, -0.5])
logits = torch.tensor([1.0, 0.0, 0.0])
targets = criterion(logits, targets) loss
This implementation is more numerically stable than separate sigmoid and BCE operations, particularly for extreme input values.
Cross-Entropy: For multi-class classification tasks:
\(L_{\text{CE}} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})\)
= nn.CrossEntropyLoss()
criterion = torch.tensor([[1.5, 0.3, -0.5], [0.2, 2.3, 0.5]]) # Batch of 2, 3 classes
logits = torch.tensor([0, 1]) # Class indices for each sample
targets = criterion(logits, targets) loss
CrossEntropyLoss combines softmax activation with negative log-likelihood loss, providing a more numerically stable implementation than separate operations.
8.7 Optimization
Optimization algorithms update model parameters to minimize the loss function. While the backpropagation algorithm computes gradients, optimizers determine how these gradients are used to adjust parameters.
8.7.1 The Optimization Process
The general training process in PyTorch follows this pattern:
# Conceptual implementation (not actual PyTorch code)
for epoch in range(num_epochs):
# Forward pass
= model(inputs)
predictions = loss_function(predictions, targets)
loss
# Reset gradients
optimizer.zero_grad()
# Backward pass (compute gradients)
loss.backward()
# Update parameters
optimizer.step()
Here optimizer
is an object created earlier in the program that implements a particular optimization algorithm. Its step
method performs a single update.
This cycle of forward pass, gradient computation, and parameter updates forms the core of neural network training.
8.7.2 Gradient Descent Variants
PyTorch implements multiple gradient descent variants through its optim
module. The simplest form is vanilla stochastic gradient descent:
= torch.optim.SGD(model.parameters(), lr=0.01) optimizer
The learning rate (lr
) determines the step size for parameter updates:
\[\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)\]
where \(\theta\) represents the parameters, \(\alpha\) is the learning rate, and \(\nabla_\theta L(\theta_t)\) is the gradient of the loss with respect to the parameters.
SGD with Momentum incorporates past gradient information to smooth updates:
= torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9) optimizer
The momentum term accumulates a velocity vector that persists across updates:
\[v_{t+1} = \gamma v_t + \alpha \nabla_\theta L(\theta_t)\] \[\theta_{t+1} = \theta_t - v_{t+1}\]
where \(\gamma\) is the momentum coefficient (typically 0.9). This approach helps overcome local minima and accelerates convergence in directions of persistent gradient.
Adam (Adaptive Moment Estimation) combines momentum with per-parameter learning rate adaptation:
= torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999)) optimizer
Adam maintains both a first moment estimate (momentum) and a second moment estimate (velocity) for each parameter:
\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta L(\theta_t)\] \[v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta L(\theta_t))^2\]
These estimates are then bias-corrected and used to update parameters with adaptive step sizes. Adam typically converges quickly and requires less manual tuning of the learning rate than SGD.
For these reasons, Adam is often the recommended starting point for new deep learning projects, though SGD with momentum sometimes achieves better final performance with proper tuning, particularly for computer vision tasks.
8.7.3 Learning Rate Scheduling
Adjusting the learning rate during training can improve convergence and final performance:
# Create an optimizer
= torch.optim.SGD(model.parameters(), lr=0.1)
optimizer
# Create a learning rate scheduler
= torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
scheduler
# In the training loop
for epoch in range(num_epochs):
# Training operations...
# Adjust learning rate after each epoch scheduler.step()
Common scheduling strategies include:
- Step decay: Reduce the learning rate by a factor after a fixed number of epochs
- Exponential decay: Continuously decrease the learning rate by a factor each epoch
- Cosine annealing: Cyclically vary the learning rate between a maximum and minimum value
Learning rate scheduling can help overcome training plateaus and fine-tune model parameters in later training stages.
8.8 Data Handling with PyTorch
Efficient data handling is crucial for neural network training, especially with large datasets. PyTorch provides dedicated abstractions for data loading and processing.
8.8.1 The Dataset Abstraction
PyTorch’s Dataset
class represents a map from indices to data samples:
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, features, labels):
self.features = features
self.labels = labels
def __len__(self):
return len(self.features)
def __getitem__(self, idx):
return self.features[idx], self.labels[idx]
A Dataset
implementation must define: - __len__
: Returns the number of samples - __getitem__
: Returns the sample at a given index
This abstraction separates data access from training logic, enabling reusable and composable data processing pipelines.
PyTorch provides subclasses for Dataset
for common cases. We will see below how to create a dataset from a tensor.
8.8.2 Efficient Batch Processing with DataLoader
The DataLoader
class wraps a dataset and provides batches of data for training. Stochastic Gradient Descent (SGD) and its variants work by processing small subsets (batches) of the training data at each step, rather than the entire dataset:
from torch.utils.data import DataLoader
= CustomDataset(features, labels)
dataset = DataLoader(
dataloader
dataset,=32, # Number of samples per batch
batch_size=True, # Randomize order each epoch
shuffle=4 # Parallel data loading processes
num_workers
)
# Training loop
for batch_idx, (batch_features, batch_labels) in enumerate(dataloader):
# Process batch
= model(batch_features)
outputs = criterion(outputs, batch_labels)
loss # ...
When we loop through the DataLoader as shown above, it automatically:
- Divides the dataset into batches of size 32
- Returns each batch one at a time when requested
- Starts over with reshuffled data when all batches have been processed
The DataLoader
handles:
- Batching samples together
- Shuffling data between epochs
- Parallel data loading with multiple worker processes (controlled by
num_workers
) - Automatic memory pinning for faster CPU-GPU transfer. Memory pinning (via
pin_memory=True
) allocates tensor data in a special “pinned” memory region that allows for faster direct memory access (DMA) transfers to the GPU.
Choosing an appropriate batch size involves balancing several considerations:
Memory constraints: Larger batch sizes consume more memory on your device. If your batch size is too large, you may encounter out-of-memory errors, especially when working with large models or high-dimensional data.
Computational efficiency: Larger batches typically enable more efficient computation due to parallelization, especially on GPUs. This often results in faster training time per epoch.
Convergence dynamics: Smaller batches introduce more noise into gradient estimates, which can:
- Help escape local minima and saddle points
- Provide regularization effects that may improve generalization
- Require lower learning rates for stability
Generalization performance: Research suggests that smaller batch sizes often lead to better generalization, while very large batches may converge to sharper minima that generalize poorly.
Common practice is to start with a moderate batch size (32-128) and adjust based on your specific constraints.
8.8.3 Built-in Dataset Utilities
PyTorch provides utilities for common dataset operations:
TensorDataset: Wraps tensors as a dataset:
from torch.utils.data import TensorDataset
= torch.randn(100, 10)
features = torch.randint(0, 2, (100,))
labels = TensorDataset(features, labels) dataset
Random Splitting: Divides a dataset into subsets:
from torch.utils.data import random_split
= int(0.8 * len(dataset))
train_size = len(dataset) - train_size
val_size = random_split(dataset, [train_size, val_size]) train_dataset, val_dataset
These utilities facilitate dataset manipulation without requiring custom implementations for common operations.
8.9 9. Putting It All Together: The Adult Dataset Example
Having explored PyTorch’s fundamental components, we now implement a complete neural network for the UCI Adult dataset, which predicts whether an individual’s income exceeds $50,000 based on census attributes.
8.9.1 9.1 Data Preparation and Preprocessing
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer
# Set the seed for all torch operations on all devices
110)
torch.manual_seed(
# Load the dataset
= fetch_openml("adult", version=2, as_frame=True)
adult = adult.data
X = adult.target
y
# Convert target to binary values (0/1)
= (y == ">50K").astype(int)
y
# Split into training and test sets
= train_test_split(
X_train, X_test, y_train, y_test =0.2, random_state=42
X, y, test_size
)
# Define preprocessing pipeline
= make_column_selector(dtype_include=np.number)
num_selector = make_column_selector(dtype_exclude=np.number)
cat_selector
= make_column_transformer(
preprocessor # Standardize numeric features
(StandardScaler(), num_selector), ='ignore',
(OneHotEncoder( handle_unknown=False),
sparse_output# One-hot encode categorical features
cat_selector), ='drop'
remainder
)
# Note: Setting sparse_output=False in OneHotEncoder produces dense numpy arrays
# instead of sparse matrices, simplifying conversion to PyTorch tensors
# Fit preprocessor on training data only
preprocessor.fit(X_train)
# Transform both training and test data
= preprocessor.transform(X_train)
X_train_proc = preprocessor.transform(X_test)
X_test_proc
# Convert to PyTorch tensors
= torch.tensor(X_train_proc, dtype=torch.float32)
X_train_tensor = torch.tensor(np.array(y_train), dtype=torch.float32)
y_train_tensor
= torch.tensor(X_test_proc, dtype=torch.float32)
X_test_tensor = torch.tensor(np.array(y_test), dtype=torch.float32) y_test_tensor
8.9.2 Creating the Dataset and DataLoader
from torch.utils.data import TensorDataset, DataLoader
# Create datasets
= TensorDataset(X_train_tensor, y_train_tensor)
train_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_dataset
# Create data loaders
= 64
batch_size = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
train_loader = DataLoader(test_dataset, batch_size=batch_size) test_loader
The DataLoader
facilitates batch processing and data shuffling during training.
8.9.3 Defining the Neural Network Model
class IncomePredictor(nn.Module):
def __init__(self, input_dim, hidden_dim=32):
super().__init__()
# First fully connected layer
self.fc1 = nn.Linear(input_dim, hidden_dim)
# Second fully connected layer (output)
self.fc2 = nn.Linear(hidden_dim, 1)
def forward(self, x):
# First layer with ReLU activation
= torch.relu(self.fc1(x))
x # Output layer (no activation - we'll use BCEWithLogitsLoss)
= self.fc2(x)
x return x
This model implements a simple feedforward neural network with:
- A single hidden layer with ReLU activation
- An output layer producing a single logit (pre-sigmoid value)
8.9.4 Training Loop
Now we’ll implement the core training process for our neural network. This process follows the standard pattern we’ve discussed:
- Forward pass: Run data through the model to get predictions
- Calculate loss: Compare predictions to actual labels
- Backward pass: Compute gradients of the loss with respect to model parameters
- Update parameters: Adjust weights using the optimizer
For this illustrative example, we’ve chosen reasonable values for hyperparameters like learning rate (0.001) and number of epochs (10). In practice, these would typically be determined through more sophisticated approaches like grid search, random search, or Bayesian optimization, which we’ll explore in later chapters.
import torch.optim as optim
# Hyperparameters
= X_train_tensor.shape[1]
input_dim = 32
hidden_dim = 0.001
learning_rate = 10
num_epochs
# Create model, loss function, and optimizer
= IncomePredictor(input_dim, hidden_dim)
model = nn.BCEWithLogitsLoss()
criterion = optim.Adam(model.parameters(), lr=learning_rate)
optimizer
# Training loop
for epoch in range(num_epochs):
# Set model to training mode
model.train()
# Process each batch
for batch_features, batch_labels in train_loader:
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass
= model(batch_features).squeeze()
logits
# Compute loss
= criterion(logits, batch_labels)
loss
# Backward pass and optimize
loss.backward() optimizer.step()
8.9.5 Model Evaluation
After training our model, we need to evaluate its performance on unseen data to assess how well it generalizes. The evaluation phase differs from training in several important ways:
- We set the model to evaluation mode with
model.eval()
. While this is not strictly necessary for our simple model, more complicated models behave differently during training and evaluation. - We disable gradient calculation with
torch.no_grad()
to save memory and computation - We use metrics beyond just the loss function to get a comprehensive view of model performance
Let’s evaluate our income prediction model on the test dataset:
from sklearn.metrics import classification_report
# Generate final predictions on test set
eval()
model.with torch.no_grad():
= []
all_preds = []
all_labels # The squeeze() method removes dimensions of size 1 from the tensor shape
# Our model outputs shape [batch_size, 1], but we need [batch_size] for the loss function
# tolist() converts tensor values to Python lists so they can be used with sklearn metrics
for features, labels in test_loader:
= model(features).squeeze()
logits = (torch.sigmoid(logits) >= 0.5).float()
preds
all_preds.extend(preds.tolist())
all_labels.extend(labels.tolist())
# Print detailed classification report
print("\nClassification Report:")
print(classification_report(all_labels, all_preds,
=['<=50K', '>50K'],
target_names=4)) digits
Classification Report:
precision recall f1-score support
<=50K 0.8957 0.9310 0.9130 7479
>50K 0.7414 0.6459 0.6903 2290
accuracy 0.8642 9769
macro avg 0.8185 0.7884 0.8017 9769
weighted avg 0.8595 0.8642 0.8608 9769
The classification report provides a comprehensive evaluation of model performance, including precision, recall, and F1-score for each class. These metrics offer more nuanced insight than accuracy alone, particularly for imbalanced datasets.
8.10 Conclusion
This chapter has presented PyTorch’s core abstractions for neural network implementation: tensors as the fundamental data structure, automatic differentiation for gradient computation, and modular components for network construction.
In the next chapter we will see how to build on this foundation to build networks with more complicated structures and deploy more sophisticated training strategies.
8.11 Exercises
- Tensor Manipulation
- Create a 3D tensor with shape
(2, 3, 4)
filled with random values. - Compute the mean along each dimension and explain the resulting shapes.
- Reshape the tensor to
(6, 4)
and then to(4, 6)
, verifying data consistency.
- Create a 3D tensor with shape
- Autograd Exploration
- Create a computational graph with multiple operations on tensor variables.
- Compute gradients and verify them against manual calculations.
- Experiment with
detach()
andwith torch.no_grad()
to understand gradient flow control.
- Network Architecture Variation
- Modify the
IncomePredictor
model to include two hidden layers instead of one. - Experiment with different activation functions (Leaky ReLU, tanh) and compare performance.
- Implement dropout regularization between layers and evaluate its impact.
- Modify the
- Optimizer Comparison
- Train the same model architecture with different optimizers (SGD, SGD+momentum, RMSprop, Adam).
- Plot learning curves for each optimizer and analyze convergence speed and stability.
- Experiment with learning rate values and explain their impact on training dynamics.
- Custom Dataset Implementation
- Create a custom
Dataset
class that adds noise to features as a form of data augmentation. - Implement on-the-fly normalization within the dataset rather than preprocessing.
- Compare model performance with and without these data handling techniques.
- Create a custom