9  Neural Network Architectures

Note

This is an EARLY DRAFT.

In previous chapters, we explored the foundational principles of neural networks. Now, we advance to more sophisticated architectures that represent the state-of-the-art in deep learning research and industry applications. These advanced architectures enable models to learn complex patterns, handle diverse data types, and achieve superior performance across a wide range of tasks.

This chapter examines several key neural network architectures and techniques that have proven effective in modern deep learning practice. We begin with the fundamental multilayer perceptron and progressively introduce more complex structures, including embedding layers, multi-block architectures, multi-output networks, and regularization techniques. Throughout the chapter, we provide theoretical insights and implementation considerations to help you understand not just how these architectures work, but why they are effective for specific problems.

9.1 Multilayer Perceptrons

The multilayer perceptron (MLP) forms the foundation of deep learning. Despite its relative simplicity, this architecture remains a powerful tool for many machine learning tasks, particularly those involving tabular data.

9.1.1 Structure and Forward Pass

A standard MLP consists of an input layer, one or more hidden layers, and an output layer. Each layer contains multiple neurons, with each neuron in a given layer connected to all neurons in the adjacent layers. This fully-connected structure is why MLPs are also known as dense networks.

Mathematically, the forward pass through a single hidden layer can be expressed as:

\[h = \sigma(W_1 x + b_1)\]

where \(x \in \mathbb{R}^{d_{in}}\) is the input vector, \(W_1 \in \mathbb{R}^{d_{hidden} \times d_{in}}\) is the weight matrix, \(b_1 \in \mathbb{R}^{d_{hidden}}\) is the bias vector, and \(\sigma\) is a non-linear activation function. The output layer then transforms the hidden representation:

\[y = W_2 h + b_2\]

where \(y \in \mathbb{R}^{d_{out}}\) is the output vector, \(W_2 \in \mathbb{R}^{d_{out} \times d_{hidden}}\) is the output weight matrix, and \(b_2 \in \mathbb{R}^{d_{out}}\) is the output bias vector.

For deeper networks with \(L\) hidden layers, the forward pass involves repeated application of affine transformations followed by non-linearities:

\[h_1 = \sigma(W_1 x + b_1)\] \[h_2 = \sigma(W_2 h_1 + b_2)\] \[\vdots\] \[h_L = \sigma(W_L h_{L-1} + b_L)\] \[y = W_{L+1} h_L + b_{L+1}\]

9.1.2 Activation Functions

The choice of activation function \(\sigma\) significantly impacts network behavior. Common activation functions include:

  • ReLU (Rectified Linear Unit): \(\sigma(z) = \max(0, z)\)
  • Sigmoid: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
  • Tanh: \(\sigma(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\)
  • LeakyReLU: \(\sigma(z) = \max(\alpha z, z)\) where \(\alpha\) is a small constant (e.g., 0.01)
  • GELU (Gaussian Error Linear Unit): \(\sigma(z) = z \cdot \Phi(z)\) where \(\Phi\) is the cumulative distribution function of the standard normal distribution

ReLU is often the default choice due to its computational efficiency and effectiveness in mitigating the vanishing gradient problem. However, LeakyReLU or GELU may offer better performance for deeper networks.

9.1.3 Implementation in PyTorch

PyTorch makes it straightforward to implement MLPs:

import torch
from torch import nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim):
        super().__init__()
        layers = []
        prev_dim = input_dim
        
        # Create hidden layers
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            prev_dim = hidden_dim
        
        # Output layer
        layers.append(nn.Linear(prev_dim, output_dim))
        
        self.model = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.model(x)

This implementation allows for flexible specification of network depth and width through the hidden_dims parameter, which accepts a list of integers representing the size of each hidden layer.

9.1.4 Strengths and Limitations

MLPs offer several advantages:

  • Universal approximation: Theoretically, MLPs with sufficient hidden units can approximate any continuous function on a compact domain to arbitrary precision.
  • Conceptual simplicity: Their straightforward structure makes them easy to understand and implement.
  • Versatility: They can handle various data types and learning tasks.

However, they also have important limitations:

  • No spatial awareness: MLPs treat inputs as flat vectors, ignoring any inherent structure in the data.
  • Parameter inefficiency: The fully-connected nature requires many parameters, which can lead to overfitting.
  • Difficulty with sequences: Standard MLPs struggle with variable-length inputs and temporal dependencies.

9.2 Embedding Layers for Categorical Variables

Many real-world datasets contain categorical variables with high cardinality (many possible values), such as user IDs, product codes, or location identifiers. Effectively incorporating these variables into neural networks requires special attention.

9.2.1 Limitations of One-Hot Encoding

The traditional approach to handling categorical variables is one-hot encoding, where each category gets its own binary feature. However, this approach has serious drawbacks for high-cardinality variables:

  1. Dimensionality explosion: With thousands or millions of possible values, one-hot encoding creates extremely sparse, high-dimensional inputs.
  2. Memory inefficiency: Storing these sparse vectors wastes memory.
  3. No semantic relationship: One-hot encoding places all categories equidistant from each other, failing to capture any semantic relationships.

9.2.2 Neural Embeddings

Embedding layers address these limitations by mapping each category to a dense vector in a lower-dimensional space. These vectors are learned during model training, allowing the network to discover meaningful relationships between categories.

Formally, an embedding layer for a categorical variable with \(K\) possible values creates a lookup table \(E \in \mathbb{R}^{K \times d}\), where \(d\) is the embedding dimension. When processing an input with category \(i\), the layer outputs the embedding vector \(E_i \in \mathbb{R}^d\) (the \(i\)-th row of the embedding matrix).

9.2.3 Embedding Dimension

The embedding dimension \(d\) is a hyperparameter that balances expressiveness against complexity. While there’s no universal rule, common heuristics include:

  • \(d \approx \sqrt[4]{K}\) where \(K\) is the number of categories
  • \(d \in [8, 512]\) with smaller values for simpler relationships and larger values for more complex ones
  • \(d \propto \log(K)\) for very large vocabularies

In practice, the optimal embedding dimension often requires experimentation.

9.2.4 Learning Meaningful Representations

Embeddings learn representations based on the prediction task, capturing aspects of the categories that are relevant to the model’s objective.

For example, in our taxi fare prediction task, location embeddings might encode:

  • Neighborhood affluence (affecting tip amounts)
  • Proximity to tourist attractions (affecting demand)
  • Traffic patterns (affecting trip duration)
  • Distance from city center (affecting base fares)

These learned representations often reveal intriguing semantic relationships. In natural language processing, word embeddings famously capture analogies like “king - man + woman ≈ queen”.

9.2.5 Implementation in PyTorch

PyTorch provides a dedicated nn.Embedding layer:

class LocationEmbeddingModel(nn.Module):
    def __init__(self, num_locations, embed_dim, other_features_dim):
        super().__init__()
        self.location_embedding = nn.Embedding(num_locations, embed_dim)
        self.fc = nn.Linear(embed_dim + other_features_dim, 1)
    
    def forward(self, location_idx, other_features):
        # location_idx: tensor of location indices
        # other_features: tensor of other input features
        
        # Get embeddings for locations
        loc_embedded = self.location_embedding(location_idx)
        
        # Concatenate with other features
        combined = torch.cat([loc_embedded, other_features], dim=1)
        
        # Final prediction
        return self.fc(combined)

This model embeds location IDs into dense vectors, concatenates them with other features, and then makes predictions through a fully-connected layer.

9.2.6 Shared Embeddings

When multiple categorical variables represent the same type of entity (e.g., pickup and dropoff locations), they can share an embedding table. This not only reduces parameters but also leverages the semantic relationships between these variables:

# Both pickup and dropoff locations share the same embedding table
self.location_embedding = nn.Embedding(num_locations, embed_dim)

# Get embeddings for pickup location
pickup_embedded = self.location_embedding(pickup_idx)

# Get embeddings for dropoff location
dropoff_embedded = self.location_embedding(dropoff_idx)

9.2.7 Pre-trained Embeddings

For some domains, pre-trained embeddings are available (e.g., GloVe or Word2Vec for text). These can provide a strong starting point, especially when training data is limited:

# Initialize with pre-trained embeddings
self.embedding = nn.Embedding.from_pretrained(
    torch.FloatTensor(pretrained_vectors),
    freeze=False  # Allow fine-tuning
)

Embeddings have become a cornerstone of modern neural networks, enabling efficient representation learning for categorical data and forming the foundation for many advanced architectures.

9.3 Multiple Block Architectures with Combined Output

Modern neural networks often employ multiple specialized blocks or modules, each designed to capture different aspects of the data or solve different sub-problems. These blocks’ outputs are then combined in diverse ways to produce the final prediction.

9.3.1 Block-Based Design

A block-based architecture separates the network into distinct components with specific roles. This modular approach offers several advantages:

  1. Specialization: Each block can focus on a specific aspect of the problem.
  2. Maintainability: Individual blocks can be modified or replaced without affecting the entire network.
  3. Interpretability: The function of each block can provide insights into the model’s decision-making process.

9.3.2 Conditional Processing Pattern

A particularly powerful pattern involves conditional processing, where one block’s output determines how another block’s output is used. Consider a time series prediction task with variables \(Y_t\) (target) and \(X_t\) (covariates), where we want to predict \(Y_{t+1}\).

We can design a two-block architecture:

  1. Persistence Block: Predicts the probability \(p\) that \(Y_{t+1} = Y_t\) (i.e., the value remains unchanged).
  2. Change Block: Predicts the value of \(Y_{t+1}\) conditional on it being different from \(Y_t\).

The final prediction combines these outputs based on the predicted persistence probability:

\[\hat{Y}_{t+1} = p \cdot Y_t + (1 - p) \cdot \text{Change}(X_t, Y_t)\]

where \(\text{Change}(X_t, Y_t)\) is the output of the change block.

9.3.3 Implementation with Trainable Threshold

This approach can be refined by introducing a trainable threshold parameter \(\tau\) that determines when to use the change block’s prediction:

class DualBlockModel(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.persistence_block = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid()  # Output is probability of persistence
        )
        
        self.change_block = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1)  # Output is the predicted new value
        )
        
        # Trainable threshold parameter (initialized at 0.5)
        self.threshold = nn.Parameter(torch.tensor([0.5]))
        
    def forward(self, x, y_t):
        # Predict persistence probability
        p = self.persistence_block(x)
        
        # Predict new value (if change occurs)
        new_value = self.change_block(x)
        
        # Compare persistence probability with threshold
        use_persistence = (p > self.threshold).float()
        
        # Final prediction: weighted combination based on threshold comparison
        y_pred = use_persistence * y_t + (1 - use_persistence) * new_value
        
        return y_pred, p

In this implementation, the threshold \(\tau\) becomes a learnable parameter, allowing the model to optimize when to trust the persistence prediction versus the change prediction.

9.3.4 Variants and Extensions

This pattern can be extended in numerous ways:

  1. Multiple Thresholds: Different thresholds for different regions of the input space.
  2. Soft Transitions: Replacing the hard threshold with a smooth function.
  3. Ensemble Approach: Using multiple specialized blocks and a meta-learner to weight their outputs.
  4. Hierarchical Structure: Organizing blocks in a tree-like structure for hierarchical decision-making.

9.3.5 Attention-Based Combination

An alternative to threshold-based combination is using attention mechanisms to dynamically weight the contributions of different blocks:

class AttentionCombinedModel(nn.Module):
    def __init__(self, input_dim, num_blocks=3):
        super().__init__()
        # Create multiple processing blocks
        self.blocks = nn.ModuleList([
            nn.Sequential(
                nn.Linear(input_dim, 64),
                nn.ReLU(),
                nn.Linear(64, output_dim)
            ) for _ in range(num_blocks)
        ])
        
        # Attention mechanism for weighting block outputs
        self.attention = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.Tanh(),
            nn.Linear(64, num_blocks),
            nn.Softmax(dim=1)  # Ensures weights sum to 1
        )
    
    def forward(self, x):
        # Get block outputs
        block_outputs = [block(x) for block in self.blocks]
        stacked_outputs = torch.stack(block_outputs, dim=1)
        
        # Generate attention weights
        weights = self.attention(x).unsqueeze(2)
        
        # Weighted combination of block outputs
        combined = torch.sum(stacked_outputs * weights, dim=1)
        
        return combined

This approach allows the model to adaptively focus on different processing pathways based on the input, effectively learning which block is most relevant for each sample.

9.3.6 Domain-Specific Block Design

The design of specialized blocks should be informed by domain knowledge. For economic time series, blocks might specialize in:

  • Trend components
  • Seasonal patterns
  • Exogenous shocks
  • Mean-reversion dynamics

By encoding such domain insights into the architecture, we create models that better align with the underlying data-generating processes.

9.4 Multiple Outcomes with Separate Losses

Many real-world problems require predicting multiple related outputs simultaneously. For instance, a model might need to predict both a worker’s future sector of employment and their expected wages. These multi-output scenarios present unique challenges and opportunities for neural network design.

9.4.1 Multi-Task Learning Framework

Multi-task learning trains a single model to perform multiple related tasks, leveraging shared representations to improve performance across all tasks. This approach has several benefits:

  1. Data Efficiency: Learning shared representations from multiple tasks can reduce the data needed for each task.
  2. Regularization: Additional tasks can act as regularizers for the primary task.
  3. Feature Importance: The model can leverage complementary information across tasks.

Formally, a multi-task model learns a mapping:

\[f: \mathbb{R}^d \rightarrow \mathbb{R}^{m_1} \times \mathbb{R}^{m_2} \times \cdots \times \mathbb{R}^{m_k}\]

where \(d\) is the input dimension and \(m_i\) is the output dimension for task \(i\).

9.4.2 Architecture Design

Multi-output architectures typically feature:

  1. Shared Layers: Initial layers that learn common representations
  2. Task-Specific Heads: Specialized output layers for each task
  3. Custom Loss Functions: Appropriate loss functions for each output type

Here’s a model that predicts both employment sector (categorical) and wages (continuous):

class WorkerPredictionModel(nn.Module):
    def __init__(self, input_dim, shared_dim, num_sectors):
        super().__init__()
        # Shared representation layers
        self.shared_network = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, shared_dim),
            nn.ReLU()
        )
        
        # Sector prediction head
        self.sector_head = nn.Sequential(
            nn.Linear(shared_dim, 128),
            nn.ReLU(),
            nn.Linear(128, num_sectors)
            # No softmax here - will use CrossEntropyLoss which includes softmax
        )
        
        # Wage prediction head
        self.wage_head = nn.Sequential(
            nn.Linear(shared_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )
    
    def forward(self, x):
        shared_features = self.shared_network(x)
        sector_logits = self.sector_head(shared_features)
        wage_prediction = self.wage_head(shared_features)
        
        return sector_logits, wage_prediction

9.4.3 Combined Loss Function

Training multi-output models requires combining the losses from each task. The simplest approach is a weighted sum:

\[\mathcal{L}_{\text{total}} = \sum_{i=1}^k \lambda_i \mathcal{L}_i\]

where \(\mathcal{L}_i\) is the loss for task \(i\) and \(\lambda_i\) is its weight.

def training_step(self, batch, batch_idx):
    x, sector_true, wage_true = batch
    
    # Forward pass
    sector_logits, wage_pred = self(x)
    
    # Task-specific losses
    sector_loss = F.cross_entropy(sector_logits, sector_true)
    wage_loss = F.mse_loss(wage_pred, wage_true)
    
    # Combined loss with weighting
    total_loss = 0.7 * sector_loss + 0.3 * wage_loss
    
    self.log("sector_loss", sector_loss)
    self.log("wage_loss", wage_loss)
    self.log("total_loss", total_loss)
    
    return total_loss

9.4.4 Handling Loss Scale Disparities

Different tasks often produce losses with different scales, which can bias training toward tasks with larger loss values. Several techniques address this issue:

  1. Task Weighting: Manually assign weights to balance task contributions.
  2. Uncertainty Weighting: Learn the optimal weights during training based on task uncertainty.
  3. Gradient Normalization: Scale gradients to have similar magnitudes across tasks.
  4. Loss Normalization: Normalize each loss by its moving average.

9.4.5 Adaptive Loss Weighting

An elegant approach to loss weighting involves learning task weights based on uncertainty, as proposed by Kendall et al. (2018):

class UncertaintyWeightedModel(nn.Module):
    def __init__(self, input_dim, shared_dim, num_sectors):
        super().__init__()
        # ... same architecture as WorkerPredictionModel ...
        
        # Learnable log variances for loss weighting
        self.log_var_sector = nn.Parameter(torch.zeros(1))
        self.log_var_wage = nn.Parameter(torch.zeros(1))
    
    def forward(self, x):
        # ... same forward pass ...
        return sector_logits, wage_prediction
    
    def training_step(self, batch, batch_idx):
        x, sector_true, wage_true = batch
        
        # Forward pass
        sector_logits, wage_pred = self(x)
        
        # Task-specific losses
        sector_loss = F.cross_entropy(sector_logits, sector_true)
        wage_loss = F.mse_loss(wage_pred, wage_true)
        
        # Uncertainty-weighted loss
        precision_sector = torch.exp(-self.log_var_sector)
        precision_wage = torch.exp(-self.log_var_wage)
        
        total_loss = (precision_sector * sector_loss + 0.5 * self.log_var_sector) + \
                     (precision_wage * wage_loss + 0.5 * self.log_var_wage)
        
        return total_loss

This approach automatically balances tasks by learning their relative importances.

9.4.6 Task Relationships

Understanding the relationships between tasks can inform better architecture design:

  1. Task Hierarchy: Some tasks naturally build upon others.
  2. Auxiliary Tasks: Secondary tasks that help learn useful representations.
  3. Competing Tasks: Tasks with conflicting objectives that require careful balancing.
  4. Sequential Dependencies: Tasks where one output influences another.

For our worker prediction example, we might leverage the relationship between sector and wages by making wage prediction conditionally dependent on the predicted sector:

def forward(self, x):
    shared_features = self.shared_network(x)
    
    # Predict sector first
    sector_logits = self.sector_head(shared_features)
    sector_probs = F.softmax(sector_logits, dim=1)
    
    # Condition wage prediction on sector probabilities
    combined_features = torch.cat([shared_features, sector_probs], dim=1)
    wage_prediction = self.wage_head(combined_features)
    
    return sector_logits, wage_prediction

This captures the real-world relationship where wages depend partly on employment sector.

9.5 Dropout

As neural networks grow deeper and wider, they become increasingly susceptible to overfitting—memorizing training data rather than learning generalizable patterns. Dropout is a simple yet remarkably effective regularization technique that addresses this problem.

9.5.1 The Dropout Mechanism

Dropout, introduced by Srivastava et al. (2014), works by randomly “dropping” (setting to zero) a fraction of neurons during each training iteration. Mathematically, for each neuron in a layer, we apply:

\[ y = \begin{cases} 0 & \text{with probability } p \\ \frac{x}{1-p} & \text{with probability } 1-p \end{cases} \]

where \(p\) is the dropout rate (typically 0.2-0.5), \(x\) is the neuron’s original output, and \(y\) is the final output. The scaling factor \(\frac{1}{1-p}\) ensures that the expected value of the output remains unchanged.

During inference (testing), no neurons are dropped, but their outputs are scaled by \(1-p\) to maintain the same expected value—though in practice, many frameworks implement this scaling during training instead of inference.

9.5.2 Conceptual Understanding

Dropout can be understood from several perspectives:

  1. Ensemble Interpretation: Each training iteration samples a different “thinned” network from the exponentially many possible subnetworks. This effectively trains an ensemble of networks that share parameters.

  2. Co-adaptation Prevention: Neurons can’t rely on specific other neurons being present, forcing them to learn robust features useful in multiple contexts.

  3. Noise Robustness: By injecting noise into the hidden activations, dropout encourages models to learn representations that are robust to perturbations.

9.5.3 Implementation in PyTorch

PyTorch provides a straightforward Dropout module:

class DropoutMLP(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, dropout_rate=0.5):
        super().__init__()
        layers = []
        prev_dim = input_dim
        
        for i, hidden_dim in enumerate(hidden_dims):
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            
            # Apply dropout to all hidden layers except the last one
            if i < len(hidden_dims) - 1:
                layers.append(nn.Dropout(dropout_rate))
            
            prev_dim = hidden_dim
        
        # Output layer
        layers.append(nn.Linear(prev_dim, output_dim))
        
        self.model = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.model(x)

9.5.4 Dropout Variants

Several variants of dropout have been proposed to address specific challenges:

  1. Spatial Dropout: Drops entire feature maps in convolutional networks, preserving spatial coherence.

  2. DropConnect: Randomly drops weights rather than activations, providing a different form of regularization.

  3. Concrete Dropout: Learns the optimal dropout rate for each layer during training.

  4. Alpha Dropout: Designed for self-normalizing neural networks, preserves the mean and variance of inputs.

  5. Variational Dropout: Uses Bayesian principles to determine which weights to drop.

9.5.5 Placement and Rate Selection

Effective dropout implementation requires careful consideration of:

  1. Placement: Typically applied after activation functions in hidden layers, but not usually on input features or output layers.

  2. Rate Selection: Higher dropout rates (e.g., 0.5) for layers with many parameters, lower rates (e.g., 0.2) for layers with fewer parameters.

  3. Model Size Adjustment: When using dropout, increasing model size often improves performance by counteracting the regularization effect.

9.5.6 Interaction with Other Techniques

Dropout interacts with other aspects of neural network training:

  1. Learning Rate: Models with dropout often benefit from higher learning rates.

  2. Weight Decay: The combination of dropout and weight decay can provide complementary forms of regularization.

  3. Batch Normalization: Batch normalization and dropout can work together but require careful ordering (more on this in the next section).

  4. Data Augmentation: Both techniques inject noise but in different spaces (feature space vs. input space).

9.6 Batch Normalization

Training deep neural networks is challenging partly due to the phenomenon of internal covariate shift—the change in the distribution of network activations due to updates to preceding layers. Batch Normalization (BatchNorm), introduced by Ioffe and Szegedy (2015), addresses this issue by normalizing layer inputs, dramatically improving training stability and speed.

9.6.1 The BatchNorm Operation

For a mini-batch of activations \(\{x_1, x_2, ..., x_m\}\), BatchNorm performs the following transformation:

  1. Calculate batch mean: \(\mu_B = \frac{1}{m} \sum_{i=1}^m x_i\)

  2. Calculate batch variance: \(\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2\)

  3. Normalize: \(\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\) where \(\epsilon\) is a small constant for numerical stability

  4. Scale and shift: \(y_i = \gamma \hat{x}_i + \beta\) where \(\gamma\) and \(\beta\) are learnable parameters

This process ensures that each feature has approximately zero mean and unit variance, while the learnable parameters \(\gamma\) and \(\beta\) allow the network to undo the normalization if needed.

9.6.2 Benefits of Batch Normalization

Batch Normalization offers several key advantages:

  1. Faster Convergence: By reducing internal covariate shift, BatchNorm allows higher learning rates and accelerates training.

  2. Regularization Effect: The batch statistics introduce noise, providing a regularizing effect similar to dropout.

  3. Reduced Sensitivity to Initialization: BatchNorm makes networks more robust to poor weight initialization.

  4. Gradient Flow: Normalization helps prevent exploding or vanishing gradients in deep networks.

  5. Smoother Optimization Landscape: BatchNorm smooths the optimization landscape, making it easier to navigate.

9.6.3 Implementation in PyTorch

Implementing BatchNorm in PyTorch is straightforward:

class BatchNormMLP(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim):
        super().__init__()
        layers = []
        prev_dim = input_dim
        
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.BatchNorm1d(hidden_dim))
            layers.append(nn.ReLU())
            prev_dim = hidden_dim
        
        # Output layer
        layers.append(nn.Linear(prev_dim, output_dim))
        
        self.model = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.model(x)

Note that BatchNorm is applied after the linear transformation but before the activation function.

9.6.4 Inference Behavior

During inference, BatchNorm uses a moving average of mean and variance calculated during training, rather than batch statistics:

\[\hat{x} = \frac{x - E[x]}{\sqrt{Var[x] + \epsilon}}\]

PyTorch handles this transition automatically through the model.train() and model.eval() methods.

9.6.5 BatchNorm Variants

Several variants of BatchNorm exist for different scenarios:

  1. Layer Normalization: Normalizes across features for each sample independently, useful for recurrent networks and when batch size is small.

  2. Instance Normalization: Normalizes each channel in each sample independently, popular in style transfer.

  3. Group Normalization: Divides channels into groups and normalizes within each group, providing a middle ground between Layer and Instance Normalization.

  4. Weight Normalization: Normalizes the weights rather than the activations.

9.6.6 Placement Considerations

The placement of BatchNorm layers requires careful consideration:

  1. Pre-activation vs. Post-activation: Most commonly used pre-activation, i.e., after the linear operation but before the non-linearity.

  2. Interaction with Dropout: Typically, BatchNorm is applied before dropout to normalize the activations that dropout will randomly zero out.

  3. First Layer: BatchNorm is sometimes omitted after the first layer when inputs are already normalized.

  4. Last Layer: BatchNorm is typically not applied after the output layer to avoid constraining the output distribution.

9.6.7 BatchNorm and Residual Networks

BatchNorm is particularly effective when combined with residual connections (discussed in the next section). The standard placement in residual blocks is:

Conv → BatchNorm → ReLU → Conv → BatchNorm → Add → ReLU

This arrangement ensures that the normalized activations are passed through the non-linearity and that the residual addition occurs before the final activation.

9.7 Residual Networks

Training very deep neural networks has historically been challenging due to optimization difficulties, vanishing/exploding gradients, and degradation problems. Residual Networks (ResNets), introduced by He et al. (2016), address these issues through a simple yet profound architectural innovation: skip connections.

9.7.1 The Residual Block

The core innovation of ResNets is the residual block, which can be expressed as:

\[y = F(x, W) + x\]

where \(x\) is the input to the block, \(F(x, W)\) is a residual mapping (typically a sequence of layers with weights \(W\)), and \(y\) is the output. The direct addition of the input \(x\) creates a shortcut connection that bypasses the residual mapping.

Intuitively, instead of learning a direct mapping \(H(x)\) from input to output, the network learns the residual mapping \(F(x) = H(x) - x\). This approach makes it easier for the network to learn identity mappings when optimal, allowing the effective training of much deeper networks.

9.7.2 Gradient Flow in ResNets

The success of ResNets can be attributed to improved gradient flow during backpropagation. Consider a loss function \(L\) and its gradient with respect to the input \(x\):

\[\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \left(\frac{\partial F(x)}{\partial x} + 1\right)\]

The constant term 1 ensures that gradients can flow directly from later layers to earlier ones, mitigating the vanishing gradient problem.

9.7.3 Basic Implementation in PyTorch

A basic residual block in PyTorch might look like:

class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)
        self.relu = nn.ReLU(inplace=True)
        
    def forward(self, x):
        residual = x
        
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        
        out = self.conv2(out)
        out = self.bn2(out)
        
        out += residual  # Add the shortcut connection
        out = self.relu(out)
        
        return out

For the case of dense layers, which are more common in economic applications, a residual block might look like:

class DenseResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.fc1 = nn.Linear(dim, dim)
        self.bn1 = nn.BatchNorm1d(dim)
        self.fc2 = nn.Linear(dim, dim)
        self.bn2 = nn.BatchNorm1d(dim)
        self.relu = nn.ReLU(inplace=True)
        
    def forward(self, x):
        residual = x
        
        out = self.fc1(x)
        out = self.bn1(out)
        out = self.relu(out)
        
        out = self.fc2(out)
        out = self.bn2(out)
        
        out += residual
        out = self.relu(out)
        
        return out

9.7.4 Dimension Matching

When the input and output dimensions differ, the shortcut connection must be adjusted. Common approaches include:

  1. Zero Padding: Pad the shortcut connection with zeros to match dimensions.
  2. Projection Shortcut: Use a linear transformation (usually 1×1 convolution or linear layer) to project the input to the desired dimension.
class DimensionChangingBlock(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, out_dim)
        self.bn1 = nn.BatchNorm1d(out_dim)
        self.fc2 = nn.Linear(out_dim, out_dim)
        self.bn2 = nn.BatchNorm1d(out_dim)
        
        # Projection shortcut for dimension matching
        self.shortcut = nn.Sequential(
            nn.Linear(in_dim, out_dim),
            nn.BatchNorm1d(out_dim)
        )
        
        self.relu = nn.ReLU(inplace=True)
        
    def forward(self, x):
        residual = self.shortcut(x)
        
        out = self.fc1(x)
        out = self.bn1(out)
        out = self.relu(out)
        
        out = self.fc2(out)
        out = self.bn2(out)
        
        out += residual
        out = self.relu(out)
        
        return out

9.7.5 ResNet Variations and Extensions

The basic ResNet architecture has inspired numerous variations:

  1. Pre-activation ResNet: Moves the batch normalization and activation before the convolution, improving gradient flow.

  2. Wide ResNet: Uses wider layers with fewer blocks, achieving similar performance with reduced depth.

  3. ResNeXt: Introduces a cardinality dimension by using grouped convolutions within residual blocks.

  4. DenseNet: Instead of simple addition, concatenates features from earlier layers, creating dense connections.

  5. SE-ResNet: Incorporates Squeeze-and-Excitation blocks that adaptively recalibrate channel-wise feature responses.

9.7.6 Bottleneck Architecture

For very deep networks, a bottleneck architecture reduces computational complexity:

class BottleneckBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, expansion=4):
        super().__init__()
        bottleneck_channels = out_channels // expansion
        
        self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, kernel_size=1)
        self.bn1 = nn.BatchNorm2d(bottleneck_channels)
        
        self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 
                               kernel_size=3, stride=stride, padding=1)
        self.bn2 = nn.BatchNorm2d(bottleneck_channels)
        
        self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, kernel_size=1)
        self.bn3 = nn.BatchNorm2d(out_channels)
        
        self.relu = nn.ReLU(inplace=True)
        
        # Shortcut connection if dimensions change
        self.shortcut = nn.Identity()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x):
        residual = self.shortcut(x)
        
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)
        
        out = self.conv3(out)
        out = self.bn3(out)
        
        out += residual
        out = self.relu(out)
        
        return out

This bottleneck design uses 1×1 convolutions to reduce and then restore dimensions, with a 3×3 convolution in between, significantly reducing the number of parameters and computations.

9.7.7 Application Beyond Computer Vision

While ResNets were initially developed for image classification, the concept of residual connections has proven valuable across domains:

  1. Natural Language Processing: Transformers use residual connections around self-attention and feed-forward layers.

  2. Time Series Analysis: Residual connections can help models capture both short-term fluctuations and long-term trends.

  3. Tabular Data: Dense residual blocks improve performance on structured data problems common in economics and finance.

  4. Generative Models: Many state-of-the-art generative architectures incorporate residual connections.

The widespread adoption of residual connections across diverse domains underscores their fundamental importance in deep learning architecture design.

9.8 Conclusion

This chapter has explored a range of neural network architectures that form the foundation of modern deep learning. We began with the simple multilayer perceptron and progressively examined more sophisticated designs: embedding layers for categorical data, multi-block architectures with conditional processing, multi-output networks, and powerful regularization techniques including dropout and batch normalization. We concluded with residual networks, which have revolutionized the training of very deep neural networks.

Several key principles emerge from this exploration:

  1. Problem-Specific Design: Neural network architecture should reflect the specific structure and requirements of the problem domain.

  2. Modularity: Breaking networks into specialized components enhances interpretability and facilitates experimentation.

  3. Information Flow: Many architectural innovations focus on improving the flow of information (and gradients) through the network.

  4. Regularization: Techniques like dropout and batch normalization help prevent overfitting while enabling more efficient training.

  5. Knowledge Transfer: While many architectures originated in computer vision or NLP, their principles transfer effectively to economic and financial applications.

In the next chapter, we will explore how to implement these architectural concepts efficiently using PyTorch Lightning, focusing on practical aspects like logging, checkpointing, early stopping, and hyperparameter search. These implementation details complement the architectural principles discussed here, enabling you to build and train sophisticated neural networks for real-world applications.