8  pytorch

Note

This is an EARLY DRAFT.

This chapter will show you how to use PyTorch to build neural network models.PyTorch has emerged as a powerful framework for deep learning research and applications, developed by Facebook’s AI Research lab (FAIR).

In this chapter, we will systematically explore the fundamental components of PyTorch, beginning with its basic data structure—the tensor—and gradually building toward complete neural network implementations. Our approach will emphasize understanding the underlying mechanisms rather than simply applying pre-built solutions, providing a foundation for both practical applications and further exploration.

8.1 Deep Learning Software Frameworks

In the early days of deep learning, researchers often had to write their own routines for matrix operations, differentiation, and training loops. Over time, dedicated software libraries emerged to make these tasks much easier. Early examples include Theano (from the University of Montreal) and Torch (developed by the broader machine-learning community). These pioneers introduced the idea of automatic differentiation and laid the foundation for modern frameworks.

Today, popular libraries include:

  • PyTorch
  • TensorFlow
  • JAX
  • Keras
  • PyTorch Lightning (built on PyTorch)

A low-level framework typically gives you considerable control over all the details, such as how the computational graph is built or how gradients are computed. This can be powerful for research and custom applications but may require more code to set up even basic neural networks. JAX and “pure” TensorFlow or PyTorch (without additional abstractions) lean in this direction.

In contrast, high-level frameworks provide a simpler interface that handles many details for you (such as training loops, logging, or distributed computing). Libraries like Keras or PyTorch Lightning let you write less code by wrapping common steps—like setting up optimizers or running epochs—in pre-built routines. This can be appealing for rapid prototyping or for ensuring more standardized code.

In this book, we will use PyTorch and PyTorch Lightning as these are the most popular neural network frameworks at the time of writing. However, the basic concepts are common across frameworks and having learnt PyTorch you will not find it too difficult to a different framework if you have to.

PyTorch is installed by default on Google Colab. You can install it on your machine using

pip install torch

8.2 Tensors: Multi-dimensional arrays

At the heart of PyTorch’s computational framework lies the tensor—a multi-dimensional array that serves as the primary data structure for all operations. While superficially similar to NumPy’s ndarray, PyTorch tensors incorporate crucial additional functionality specifically designed for deep learning workloads.

8.2.1 Tensor Fundamentals

A tensor can be conceptualized as a generalization of vectors and matrices to potentially higher dimensions. The dimensionality of a tensor is referred to as its “rank”:

  • A scalar is a rank-0 tensor (single value)
  • A vector is a rank-1 tensor (one-dimensional array)
  • A matrix is a rank-2 tensor (two-dimensional array)
  • Higher-dimensional arrays constitute rank-n tensors

Tensors in PyTorch are instantiated through various factory methods, each serving different initialization needs:

import torch

# Creating a rank-1 tensor (vector)
x = torch.tensor([1, 2, 3, 4])

# Creating a rank-2 tensor (matrix) with specific values
y = torch.tensor([[1, 2, 3], 
                  [4, 5, 6]])

# Creating tensors with uniform initialization
zeros = torch.zeros(2, 3)  # 2×3 tensor filled with zeros
ones = torch.ones(2, 3)    # 2×3 tensor filled with ones

# Creating tensors with random values
rand_uniform = torch.rand(2, 3)      # Uniform distribution [0, 1)
rand_normal = torch.randn(2, 3)      # Standard normal distribution

Each tensor possesses several intrinsic attributes that characterize its structure and properties:

# Examining tensor attributes
x = torch.tensor([[1, 2, 3], [4, 5, 6]])

print(f"Shape: {x.shape}")         # Size along each dimension: torch.Size([2, 3])
print(f"Rank/Dimensionality: {x.ndim}")  # Number of dimensions: 2
print(f"Data type: {x.dtype}")     # Underlying data type: torch.int64
print(f"Number of elements: {x.numel()}")  # Total element count: 6

The shape attribute is particularly important as it defines the tensor’s dimensional structure—in this case, a 2×3 matrix with 2 rows and 3 columns.

8.2.2 Tensor Operations

PyTorch implements a comprehensive set of operations for tensor manipulation, broadly categorized into element-wise operations, reduction operations, and linear algebra operations.

Element-wise Operations: These apply a function independently to each element:

a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])

# Addition, subtraction, multiplication, division
c = a + b     # torch.tensor([5, 7, 9])
d = a * b     # torch.tensor([4, 10, 18])

Reduction Operations: These reduce a tensor along specified dimensions:

x = torch.tensor([[1, 2, 3], [4, 5, 6]])

# Sum of all elements
total = x.sum()           # tensor(21)

# Mean along rows (dimension 1)
row_means = x.mean(dim=1)  # tensor([2., 5.])

Linear Algebra Operations: These perform matrix operations essential for neural networks:

m1 = torch.tensor([[1, 2], [3, 4]])
m2 = torch.tensor([[5, 6], [7, 8]])

# Matrix multiplication
mm = torch.matmul(m1, m2)   # tensor([[19, 22], [43, 50]])
# Equivalent syntax using @ operator
mm_alt = m1 @ m2            # tensor([[19, 22], [43, 50]])

8.2.3 Tensor Memory Management

A critical aspect of working with tensors, particularly when implementing complex neural networks, is understanding how PyTorch manages tensor memory and views.

Reshaping and Views: PyTorch offers two primary mechanisms for rearranging tensor dimensions:

x = torch.tensor([[1, 2, 3], [4, 5, 6]])

# Creating a new shape
reshaped = x.reshape(3, 2)  # tensor([[1, 2], [3, 4], [5, 6]])

# Creating a view (shares the same memory)
view = x.view(3, 2)         # tensor([[1, 2], [3, 4], [5, 6]])

The distinction between reshape and view is subtle but important:

  • view() creates a new tensor that shares the same underlying data with the original tensor. It requires the tensor to be contiguous in memory.
  • reshape() may or may not share memory with the original tensor. If the tensor is not contiguous in memory, reshape() will create a copy.

This distinction becomes particularly relevant when modifying elements:

# Modifying an element in the view
view[0, 0] = 99

# Original tensor is also modified because the view shares the same memory
print(x)  # tensor([[99, 2, 3], [4, 5, 6]])

Device Management: One of PyTorch’s strengths is its seamless handling of different computational devices, particularly CPUs and GPUs:

# Creating a tensor on CPU (default)
cpu_tensor = torch.tensor([1, 2, 3])

# Moving to GPU if available
if torch.cuda.is_available():
    gpu_tensor = cpu_tensor.to('cuda')
    # Operations on gpu_tensor will execute on the GPU

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model that enables dramatic performance increases by harnessing the power of GPUs. When PyTorch code uses CUDA, computations are offloaded to the GPU, which can perform thousands of operations simultaneously, making it ideal for the parallel nature of neural network calculations.

PyTorch’s device management system handles two critical aspects:

  1. Data Placement: Tensors can reside on different hardware devices (CPU or GPU). Unlike traditional programming where all data lives in the same memory space, GPUs have their own dedicated memory that’s separate from system RAM. When you move a tensor to a GPU with .to('cuda'), PyTorch physically transfers that data from CPU memory to GPU memory. This is a crucial concept because:
    • GPU memory is typically more limited than system RAM
    • Data transfers between CPU and GPU involve overhead
    • Operations can only be accelerated if the data actually resides in GPU memory
  2. Computation Location: Operations execute on the device where the tensors are located. The computational benefits of GPUs only apply to tensors that have been explicitly placed in GPU memory.

When operating on tensors that reside on different devices, PyTorch enforces explicit data movement:

# Attempting operations on tensors from different devices
if torch.cuda.is_available():
    cpu_tensor = torch.tensor([1, 2, 3])
    gpu_tensor = torch.tensor([4, 5, 6], device='cuda')
    
    # This would raise a RuntimeError:
    # result = cpu_tensor + gpu_tensor
    
    # Correct approaches:
    result_on_gpu = cpu_tensor.to('cuda') + gpu_tensor  # Computation on GPU
    result_on_cpu = cpu_tensor + gpu_tensor.to('cpu')   # Computation on CPU

This explicit device management prevents unintended performance bottlenecks from device-to-device transfers and gives developers precise control over where computations occur.

This device management system allows for flexible deployment across heterogeneous computing environments, from development laptops to GPU clusters, with minimal code changes.

8.3 Autograd: Automatic Differentiation

As we have seen in the last chapter, neural networks are trained with variants of stochastic gradient descent, for which we need to compute the gradient of the function defined by the network. We also saw how automatic differentiation software can automatically calculate gradients from the code implementing a function. PyTorch has an integral framework called Autograd tightly integrated with its tensors for this purpose.

8.3.1 Computational Graphs and Automatic Differentiation

At its core, Autograd builds a computational graph that tracks how values are calculated. Think of this graph as a flowchart:

  • Each operation (like addition or multiplication) is a box in the flowchart
  • The data flows along arrows between these boxes
  • The graph is “directed” because data flows in one direction (from inputs to outputs)
  • The graph is “acyclic” because data never loops back (no circular dependencies)

When you create a tensor with requires_grad=True, PyTorch automatically records all operations performed on that tensor, building this flowchart-like structure behind the scenes.

Consider a simple computational example:

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Define a computation
z = x**2 + y**3

In this example, PyTorch constructs a computational graph where:

  • Leaf nodes are the input tensors x and y
  • Intermediate nodes represent operations (x**2, y**3, and addition)
  • The output node is the tensor z

By recording all the computational steps the graphs allows the autograd system to calculate the gradient of the final outputs with respect to the inputs using the chain rule from calculus.

8.3.2 Computing Gradients

Once a computational graph is constructed, PyTorch can automatically compute gradients. It uses the reverse mode automatic differentiation for efficiency as discussed in the previous chapter. The calculation is initiated by calling the .backward() method on a scalar output tensor:

# Continuing from the previous example
z.backward()

# Accessing gradients
print(f"dz/dx: {x.grad}")  # Should be 4.0 (derivative of x^2 is 2x)
print(f"dz/dy: {y.grad}")  # Should be 27.0 (derivative of y^3 is 3y^2)

Mathematically, we can verify these gradients:

  • For function \(z = x^2 + y^3\)
  • \(\frac{\partial z}{\partial x} = 2x = 2 \cdot 2 = 4\)
  • \(\frac{\partial z}{\partial y} = 3y^2 = 3 \cdot 3^2 = 27\)

This automatic calculation of gradients is fundamental to training neural networks, as it provides the necessary derivatives for parameter updates without requiring manual implementation of derivative calculations for each operation.

8.3.3 Gradient Behavior and Control

Several aspects of gradient behavior require careful consideration when implementing neural networks:

Gradient Accumulation: By default, PyTorch accumulates gradients when .backward() is called multiple times:

x = torch.tensor(2.0, requires_grad=True)

# First operation and backward pass
y1 = x**2
y1.backward()
print(f"Gradient after first backward: {x.grad}")  # 4.0

# Second operation and backward pass (gradients accumulate)
y2 = x**3
y2.backward()
print(f"Gradient after second backward: {x.grad}")  # 4.0 + 12.0 = 16.0

This accumulation behavior is particularly valuable for neural networks because many loss functions are inherently sums over data points or batches. For example, when computing the mean squared error over a batch, we’re summing individual squared errors.

This accumulation necessitates explicit gradient zeroing at the beginning of training loops:

# Zeroing gradients
x.grad.zero_()

Detaching Computation: In some scenarios, we may want to temporarily stop gradient tracking:

# Detaching from computation graph
x = torch.tensor(2.0, requires_grad=True)
y = x**2
z = y.detach() + 5  # z has no gradient relationship to x

The detach() method creates a new tensor that shares the same data but does not track computational history. This is useful for implementing techniques like semi-supervised learning or when working with pre-trained model components.

Gradient Context Management: For evaluation phases where gradient computation is unnecessary and potentially wasteful:

with torch.no_grad():
    # No operations inside this block will track gradients
    evaluation_result = model(test_data)

The torch.no_grad() context manager temporarily disables gradient tracking, reducing memory consumption and computational overhead during inference or evaluation.

8.4 Neural Network Building Blocks: Modules and Layers

Neural networks in PyTorch are constructed using a hierarchical system of building blocks, with the nn.Module class serving as the foundational component. This design facilitates both the organization of model architecture and the management of learnable parameters.

8.4.1 4.1 The nn.Module Class

The nn.Module class is a versatile container that serves several critical functions:

  1. Parameter management: Automatically tracks and registers learnable parameters
  2. Hierarchical composition: Enables nested structures of modules within modules
  3. Training state: Manages mode switching between training and evaluation
  4. Serialization: Facilitates saving and loading model states

Every custom neural network component in PyTorch inherits from this base class:

import torch.nn as nn

class SimpleNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        # Initialize the parent class
        super().__init__()
        
        # Define layers as module attributes
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        # Define the computation flow
        x = torch.relu(self.layer1(x))
        x = self.layer2(x)
        return x

This structure demonstrates several key aspects of PyTorch’s module system:

  • The constructor (__init__) establishes the module’s structure by defining its constituent components.
  • The forward method defines the computation flow when the module is called.
  • Modules can contain other modules (here, each nn.Linear layer is itself a module).

When instantiated, this network automatically registers all parameters from its submodules:

model = SimpleNetwork(10, 20, 1)

# Access parameters
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

This might output:

layer1.weight: torch.Size([20, 10])
layer1.bias: torch.Size([20])
layer2.weight: torch.Size([1, 20])
layer2.bias: torch.Size([1])

The module system handles parameter registration without requiring manual intervention, simplifying model development and maintenance.

8.4.2 Parameters and Buffers

PyTorch distinguishes between two types of tensor state within modules:

Parameters: Learnable tensors that will be updated during training:

# A parameter is automatically registered with requires_grad=True
self.weight = nn.Parameter(torch.randn(output_size, input_size))
self.bias = nn.Parameter(torch.randn(output_size))

Buffers: Non-learnable tensors that are part of the module’s state:

# A buffer is registered but not included in parameters()
self.register_buffer('running_mean', torch.zeros(num_features))

Buffers typically store statistical information or fixed values that should be serialized (saved and loaded) with the model but not updated by optimizers.

8.4.3 Layer Initialization and the Linear Layer

The nn.Linear layer, one of the most fundamental components in neural networks, implements a fully connected (dense) layer that performs a linear transformation:

\(y = xW^T + b\)

where \(x\) is the input, \(W\) is the weight matrix, and \(b\) is the bias vector.

Examining its attributes provides insight into PyTorch’s layer design:

linear = nn.Linear(in_features=10, out_features=5)

# Examining the layer parameters
print(f"Weight shape: {linear.weight.shape}")  # torch.Size([5, 10])
print(f"Bias shape: {linear.bias.shape}")      # torch.Size([5])

The weight matrix dimensions are [out_features, in_features], which may initially seem counterintuitive. This arrangement, however, facilitates efficient batch processing when the input x has a batch dimension as the first dimension:

batch_size = 32
x = torch.randn(batch_size, 10)  # 32 samples, each with 10 features
output = linear(x)               # Shape: [32, 5]

The matrix multiplication is effectively \((32 \times 10) \cdot (10 \times 5)^T = (32 \times 5)\), preserving the batch dimension.

8.4.3.1 Parameter Initialization

Parameter initialization plays a crucial role in neural network training dynamics. PyTorch layers initialize parameters with specific schemes by default, but these can be customized:

import torch.nn.init as init

# Custom initialization of a layer's parameters
linear = nn.Linear(10, 5)
init.xavier_uniform_(linear.weight)  # Xavier/Glorot initialization
init.zeros_(linear.bias)             # Initialize biases to zero

Common initialization methods include:

  • Xavier/Glorot initialization: Designed to maintain variance across layers, particularly effective with tanh or sigmoid activations
  • Kaiming/He initialization: Adapted for ReLU-based networks
  • Constant initialization: Often used for biases, typically with zeros

The choice of initialization strategy can significantly impact training convergence and should be aligned with the activation functions and network architecture.

For linear layers, by default, weights are initialized using Kaiming uniform initialization (also called He initialization), which is designed for ReLU activations. Biases are initialized to zeros.

8.4.4 The Callable Interface: Module Forward Pass

A module object can be called like a function. When it is called with input data, e.g., output = model(input), PyTorch invokes a series of methods:

  1. The __call__ method of the parent nn.Module class is triggered
  2. __call__ performs several housekeeping operations:
    • Ensures the module is properly initialized
    • Triggers registered hooks (pre-forward hooks)
    • Calls the module’s forward method with the provided inputs - this is the only method you must implement when creating a custom module
    • Triggers post-forward hooks
    • Returns the result of the forward method, which is typically the output tensor(s) produced by the module’s computation

Hooks are customizable callback functions that allow you to “hook into” specific points in PyTorch’s execution flow.

Hooks enable you to:

  • Monitor internal values during training
  • Modify inputs or outputs on-the-fly
  • Collect statistics without changing the model code
  • Debug complex networks by inspecting intermediate values

This sophisticated mechanism enables transparent integration of functionality like hook registration, module mode management, and debugging tools without cluttering the user-defined forward method.

8.4.5 Building Complex Architectures

Complex neural network architectures can be constructed by composing modules in various ways:

Sequential Containers: For linear sequences of layers:

model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 15),
    nn.ReLU(),
    nn.Linear(15, 1)
)

This nn.Sequential container is a complete, trainable module without requiring any subclassing of nn.Module. However, sequential models only support simple linear topologies where each layer feeds directly into the next. For more complicated architectures with you’ll still need to create a custom nn.Module subclass.

Module Lists and Dictionaries: For more flexible arrangements:

class DynamicNetwork(nn.Module):
    def __init__(self, layer_sizes):
        super().__init__()
        # Create a ModuleList of layers
        self.layers = nn.ModuleList([
            nn.Linear(layer_sizes[i], layer_sizes[i+1])
            for i in range(len(layer_sizes)-1)
        ])
        
    def forward(self, x):
        for i, layer in enumerate(self.layers):
            x = layer(x)
            # Apply ReLU to all but the last layer
            if i < len(self.layers) - 1:
                x = torch.relu(x)
        return x

Unlike regular Python lists and dictionaries, nn.ModuleList and nn.ModuleDict are special container classes that:

  1. Register parameters: When you add a module to these containers, all its parameters are automatically registered with the parent module, ensuring they’re properly tracked for gradient updates.

  2. Device management: When you move a model to a device (e.g., model.to('cuda')), all modules in these containers are automatically moved too.

  3. State management: These containers participate in module state changes like train() and eval() mode, propagating these calls to all contained modules.

  4. Serialization: When saving or loading a model, modules in these containers are properly included in the serialization process.

If you used regular Python lists or dictionaries instead, the modules would exist but wouldn’t be recognized as part of the model’s parameter hierarchy, leading to parameters that don’t get updated during training.

These building blocks provide a flexible system for architecting neural networks of arbitrary complexity while maintaining a clean separation between structural definition and computational flow.

8.5 Activation Functions

Activation functions introduce non-linearity into neural networks, a crucial property that enables these models to learn complex patterns and relationships in data. Without non-linear activations, multiple layers would simply collapse into a single linear transformation, regardless of network depth.

PyTorch provides both functional and module-based implementations of activation functions. The functional variants operate directly on tensors, while the module variants can be incorporated into nn.Module hierarchies:

import torch.nn.functional as F

# Functional form
x = torch.randn(5)
relu_output = F.relu(x)

# Module form
relu_layer = nn.ReLU()
same_output = relu_layer(x)

The most widely used activation functions include:

ReLU (Rectified Linear Unit): \(\text{ReLU}(x) = \max(0, x)\)

activation = nn.ReLU()
# or functional form: F.relu(x)

ReLU is computationally efficient and mitigates the vanishing gradient problem that plagued earlier neural networks with sigmoid activations. However, it can suffer from the “dying ReLU” problem, where neurons permanently output zero for all inputs.

Leaky ReLU: \(\text{LeakyReLU}(x) = \max(\alpha x, x)\) where \(\alpha\) is a small constant (typically 0.01)

activation = nn.LeakyReLU(negative_slope=0.01)
# or functional form: F.leaky_relu(x, negative_slope=0.01)

Leaky ReLU addresses the dying ReLU problem by allowing a small, non-zero gradient when the unit is not active.

Sigmoid: \(\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}\)

activation = nn.Sigmoid()
# or functional form: F.sigmoid(x)

The sigmoid function maps inputs to the range (0, 1), making it suitable for binary classification output layers. However, it suffers from vanishing gradients for inputs with large magnitudes.

Tanh (Hyperbolic Tangent): \(\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)

activation = nn.Tanh()
# or functional form: F.tanh(x)

Similar to sigmoid but mapping to the range (-1, 1), tanh is often preferred over sigmoid for hidden layers due to its zero-centered output.

8.6 Loss Functions

PyTorch also provides commonly used loss functions in its nn module: PyTorch provides specialized loss functions for different prediction tasks:

Mean Squared Error (MSE): For regression tasks:

\(L_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\)

criterion = nn.MSELoss()
predictions = torch.tensor([0.5, 1.8, 2.5])
targets = torch.tensor([1.0, 2.0, 2.0])
loss = criterion(predictions, targets)

MSE penalizes larger errors more heavily due to the squared term, making it sensitive to outliers.

Binary Cross-Entropy (BCE): For binary classification tasks:

\(L_{\text{BCE}} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]\)

# For predictions already passed through sigmoid
criterion = nn.BCELoss()
probabilities = torch.tensor([0.2, 0.7, 0.9])
targets = torch.tensor([0.0, 1.0, 1.0])
loss = criterion(probabilities, targets)

BCE with Logits: Combines sigmoid activation with BCE for numerical stability:

# For raw logits (pre-sigmoid outputs)
criterion = nn.BCEWithLogitsLoss()
logits = torch.tensor([1.5, 0.3, -0.5])
targets = torch.tensor([1.0, 0.0, 0.0])
loss = criterion(logits, targets)

This implementation is more numerically stable than separate sigmoid and BCE operations, particularly for extreme input values.

Cross-Entropy: For multi-class classification tasks:

\(L_{\text{CE}} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})\)

criterion = nn.CrossEntropyLoss()
logits = torch.tensor([[1.5, 0.3, -0.5], [0.2, 2.3, 0.5]])  # Batch of 2, 3 classes
targets = torch.tensor([0, 1])  # Class indices for each sample
loss = criterion(logits, targets)

CrossEntropyLoss combines softmax activation with negative log-likelihood loss, providing a more numerically stable implementation than separate operations.

8.7 Optimization

Optimization algorithms update model parameters to minimize the loss function. While the backpropagation algorithm computes gradients, optimizers determine how these gradients are used to adjust parameters.

8.7.1 The Optimization Process

The general training process in PyTorch follows this pattern:

# Conceptual implementation (not actual PyTorch code)
for epoch in range(num_epochs):
    # Forward pass
    predictions = model(inputs)
    loss = loss_function(predictions, targets)
    
    # Reset gradients
    optimizer.zero_grad()
    
    # Backward pass (compute gradients)
    loss.backward()
    
    # Update parameters
    optimizer.step()

Here optimizer is an object created earlier in the program that implements a particular optimization algorithm. Its step method performs a single update.

This cycle of forward pass, gradient computation, and parameter updates forms the core of neural network training.

8.7.2 Gradient Descent Variants

PyTorch implements multiple gradient descent variants through its optim module. The simplest form is vanilla stochastic gradient descent:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

The learning rate (lr) determines the step size for parameter updates:

\[\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)\]

where \(\theta\) represents the parameters, \(\alpha\) is the learning rate, and \(\nabla_\theta L(\theta_t)\) is the gradient of the loss with respect to the parameters.

SGD with Momentum incorporates past gradient information to smooth updates:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

The momentum term accumulates a velocity vector that persists across updates:

\[v_{t+1} = \gamma v_t + \alpha \nabla_\theta L(\theta_t)\] \[\theta_{t+1} = \theta_t - v_{t+1}\]

where \(\gamma\) is the momentum coefficient (typically 0.9). This approach helps overcome local minima and accelerates convergence in directions of persistent gradient.

Adam (Adaptive Moment Estimation) combines momentum with per-parameter learning rate adaptation:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

Adam maintains both a first moment estimate (momentum) and a second moment estimate (velocity) for each parameter:

\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta L(\theta_t)\] \[v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta L(\theta_t))^2\]

These estimates are then bias-corrected and used to update parameters with adaptive step sizes. Adam typically converges quickly and requires less manual tuning of the learning rate than SGD.

For these reasons, Adam is often the recommended starting point for new deep learning projects, though SGD with momentum sometimes achieves better final performance with proper tuning, particularly for computer vision tasks.

8.7.3 Learning Rate Scheduling

Adjusting the learning rate during training can improve convergence and final performance:

# Create an optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Create a learning rate scheduler
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# In the training loop
for epoch in range(num_epochs):
    # Training operations...
    scheduler.step()  # Adjust learning rate after each epoch

Common scheduling strategies include:

  • Step decay: Reduce the learning rate by a factor after a fixed number of epochs
  • Exponential decay: Continuously decrease the learning rate by a factor each epoch
  • Cosine annealing: Cyclically vary the learning rate between a maximum and minimum value

Learning rate scheduling can help overcome training plateaus and fine-tune model parameters in later training stages.

8.8 Data Handling with PyTorch

Efficient data handling is crucial for neural network training, especially with large datasets. PyTorch provides dedicated abstractions for data loading and processing.

8.8.1 The Dataset Abstraction

PyTorch’s Dataset class represents a map from indices to data samples:

from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels
    
    def __len__(self):
        return len(self.features)
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

A Dataset implementation must define: - __len__: Returns the number of samples - __getitem__: Returns the sample at a given index

This abstraction separates data access from training logic, enabling reusable and composable data processing pipelines.

PyTorch provides subclasses for Dataset for common cases. We will see below how to create a dataset from a tensor.

8.8.2 Efficient Batch Processing with DataLoader

The DataLoader class wraps a dataset and provides batches of data for training. Stochastic Gradient Descent (SGD) and its variants work by processing small subsets (batches) of the training data at each step, rather than the entire dataset:

from torch.utils.data import DataLoader

dataset = CustomDataset(features, labels)
dataloader = DataLoader(
    dataset,
    batch_size=32,       # Number of samples per batch
    shuffle=True,        # Randomize order each epoch
    num_workers=4        # Parallel data loading processes
)

# Training loop
for batch_idx, (batch_features, batch_labels) in enumerate(dataloader):
    # Process batch
    outputs = model(batch_features)
    loss = criterion(outputs, batch_labels)
    # ...

When we loop through the DataLoader as shown above, it automatically:

  1. Divides the dataset into batches of size 32
  2. Returns each batch one at a time when requested
  3. Starts over with reshuffled data when all batches have been processed

The DataLoader handles:

  • Batching samples together
  • Shuffling data between epochs
  • Parallel data loading with multiple worker processes (controlled by num_workers)
  • Automatic memory pinning for faster CPU-GPU transfer. Memory pinning (via pin_memory=True) allocates tensor data in a special “pinned” memory region that allows for faster direct memory access (DMA) transfers to the GPU.

Choosing an appropriate batch size involves balancing several considerations:

  1. Memory constraints: Larger batch sizes consume more memory on your device. If your batch size is too large, you may encounter out-of-memory errors, especially when working with large models or high-dimensional data.

  2. Computational efficiency: Larger batches typically enable more efficient computation due to parallelization, especially on GPUs. This often results in faster training time per epoch.

  3. Convergence dynamics: Smaller batches introduce more noise into gradient estimates, which can:

    • Help escape local minima and saddle points
    • Provide regularization effects that may improve generalization
    • Require lower learning rates for stability
  4. Generalization performance: Research suggests that smaller batch sizes often lead to better generalization, while very large batches may converge to sharper minima that generalize poorly.

Common practice is to start with a moderate batch size (32-128) and adjust based on your specific constraints.

8.8.3 Built-in Dataset Utilities

PyTorch provides utilities for common dataset operations:

TensorDataset: Wraps tensors as a dataset:

from torch.utils.data import TensorDataset

features = torch.randn(100, 10)
labels = torch.randint(0, 2, (100,))
dataset = TensorDataset(features, labels)

Random Splitting: Divides a dataset into subsets:

from torch.utils.data import random_split

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

These utilities facilitate dataset manipulation without requiring custom implementations for common operations.

8.9 9. Putting It All Together: The Adult Dataset Example

Having explored PyTorch’s fundamental components, we now implement a complete neural network for the UCI Adult dataset, which predicts whether an individual’s income exceeds $50,000 based on census attributes.

8.9.1 9.1 Data Preparation and Preprocessing

import numpy as np
import pandas as pd
import torch
import torch.nn as nn

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer

# Set the seed for all torch operations on all devices

torch.manual_seed(110)

# Load the dataset
adult = fetch_openml("adult", version=2, as_frame=True)
X = adult.data
y = adult.target

# Convert target to binary values (0/1)
y = (y == ">50K").astype(int)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define preprocessing pipeline
num_selector = make_column_selector(dtype_include=np.number)
cat_selector = make_column_selector(dtype_exclude=np.number)

preprocessor = make_column_transformer(
    (StandardScaler(), num_selector),             # Standardize numeric features
    (OneHotEncoder( handle_unknown='ignore', 
                    sparse_output=False), 
    cat_selector),  # One-hot encode categorical features
    remainder='drop'
)

# Note: Setting sparse_output=False in OneHotEncoder produces dense numpy arrays
# instead of sparse matrices, simplifying conversion to PyTorch tensors

# Fit preprocessor on training data only
preprocessor.fit(X_train)

# Transform both training and test data
X_train_proc = preprocessor.transform(X_train)
X_test_proc = preprocessor.transform(X_test)

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train_proc, dtype=torch.float32)
y_train_tensor = torch.tensor(np.array(y_train), dtype=torch.float32)

X_test_tensor = torch.tensor(X_test_proc, dtype=torch.float32)
y_test_tensor = torch.tensor(np.array(y_test), dtype=torch.float32)

8.9.2 Creating the Dataset and DataLoader

from torch.utils.data import TensorDataset, DataLoader

# Create datasets
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

# Create data loaders
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

The DataLoader facilitates batch processing and data shuffling during training.

8.9.3 Defining the Neural Network Model

class IncomePredictor(nn.Module):
    def __init__(self, input_dim, hidden_dim=32):
        super().__init__()
        # First fully connected layer
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        # Second fully connected layer (output)
        self.fc2 = nn.Linear(hidden_dim, 1)
        
    def forward(self, x):
        # First layer with ReLU activation
        x = torch.relu(self.fc1(x))
        # Output layer (no activation - we'll use BCEWithLogitsLoss)
        x = self.fc2(x)
        return x

This model implements a simple feedforward neural network with:

  • A single hidden layer with ReLU activation
  • An output layer producing a single logit (pre-sigmoid value)

8.9.4 Training Loop

Now we’ll implement the core training process for our neural network. This process follows the standard pattern we’ve discussed:

  1. Forward pass: Run data through the model to get predictions
  2. Calculate loss: Compare predictions to actual labels
  3. Backward pass: Compute gradients of the loss with respect to model parameters
  4. Update parameters: Adjust weights using the optimizer

For this illustrative example, we’ve chosen reasonable values for hyperparameters like learning rate (0.001) and number of epochs (10). In practice, these would typically be determined through more sophisticated approaches like grid search, random search, or Bayesian optimization, which we’ll explore in later chapters.

import torch.optim as optim

# Hyperparameters
input_dim = X_train_tensor.shape[1]
hidden_dim = 32
learning_rate = 0.001
num_epochs = 10

# Create model, loss function, and optimizer
model = IncomePredictor(input_dim, hidden_dim)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    # Set model to training mode
    model.train()
    
    # Process each batch
    for batch_features, batch_labels in train_loader:
        # Zero the parameter gradients
        optimizer.zero_grad()
        
        # Forward pass
        logits = model(batch_features).squeeze()
        
        # Compute loss
        loss = criterion(logits, batch_labels)
        
        # Backward pass and optimize
        loss.backward()
        optimizer.step()

8.9.5 Model Evaluation

After training our model, we need to evaluate its performance on unseen data to assess how well it generalizes. The evaluation phase differs from training in several important ways:

  1. We set the model to evaluation mode with model.eval(). While this is not strictly necessary for our simple model, more complicated models behave differently during training and evaluation.
  2. We disable gradient calculation with torch.no_grad() to save memory and computation
  3. We use metrics beyond just the loss function to get a comprehensive view of model performance

Let’s evaluate our income prediction model on the test dataset:

from sklearn.metrics import classification_report

# Generate final predictions on test set
model.eval()
with torch.no_grad():
    all_preds = []
    all_labels = []
    # The squeeze() method removes dimensions of size 1 from the tensor shape
    # Our model outputs shape [batch_size, 1], but we need [batch_size] for the loss function
    # tolist() converts tensor values to Python lists so they can be used with sklearn metrics
    for features, labels in test_loader:
        logits = model(features).squeeze()
        preds = (torch.sigmoid(logits) >= 0.5).float()
        all_preds.extend(preds.tolist())
        all_labels.extend(labels.tolist())

# Print detailed classification report
print("\nClassification Report:")
print(classification_report(all_labels, all_preds, 
                          target_names=['<=50K', '>50K'],
                          digits=4))

Classification Report:
              precision    recall  f1-score   support

       <=50K     0.8957    0.9310    0.9130      7479
        >50K     0.7414    0.6459    0.6903      2290

    accuracy                         0.8642      9769
   macro avg     0.8185    0.7884    0.8017      9769
weighted avg     0.8595    0.8642    0.8608      9769

The classification report provides a comprehensive evaluation of model performance, including precision, recall, and F1-score for each class. These metrics offer more nuanced insight than accuracy alone, particularly for imbalanced datasets.

8.10 Conclusion

This chapter has presented PyTorch’s core abstractions for neural network implementation: tensors as the fundamental data structure, automatic differentiation for gradient computation, and modular components for network construction.

In the next chapter we will see how to build on this foundation to build networks with more complicated structures and deploy more sophisticated training strategies.

8.11 Exercises

  1. Tensor Manipulation
    • Create a 3D tensor with shape (2, 3, 4) filled with random values.
    • Compute the mean along each dimension and explain the resulting shapes.
    • Reshape the tensor to (6, 4) and then to (4, 6), verifying data consistency.
  2. Autograd Exploration
    • Create a computational graph with multiple operations on tensor variables.
    • Compute gradients and verify them against manual calculations.
    • Experiment with detach() and with torch.no_grad() to understand gradient flow control.
  3. Network Architecture Variation
    • Modify the IncomePredictor model to include two hidden layers instead of one.
    • Experiment with different activation functions (Leaky ReLU, tanh) and compare performance.
    • Implement dropout regularization between layers and evaluate its impact.
  4. Optimizer Comparison
    • Train the same model architecture with different optimizers (SGD, SGD+momentum, RMSprop, Adam).
    • Plot learning curves for each optimizer and analyze convergence speed and stability.
    • Experiment with learning rate values and explain their impact on training dynamics.
  5. Custom Dataset Implementation
    • Create a custom Dataset class that adds noise to features as a form of data augmentation.
    • Implement on-the-fly normalization within the dataset rather than preprocessing.
    • Compare model performance with and without these data handling techniques.