9 Neural Network Architectures
This is an EARLY DRAFT.
In previous chapters, we explored the foundational principles of neural networks. Now, we advance to more sophisticated architectures that represent the state-of-the-art in deep learning research and industry applications. These advanced architectures enable models to learn complex patterns, handle diverse data types, and achieve superior performance across a wide range of tasks.
This chapter examines several key neural network architectures and techniques that have proven effective in modern deep learning practice. We begin with the fundamental multilayer perceptron and progressively introduce more complex structures, including embedding layers, multi-block architectures, multi-output networks, and regularization techniques. Throughout the chapter, we provide theoretical insights and implementation considerations to help you understand not just how these architectures work, but why they are effective for specific problems.
9.1 Multilayer Perceptrons
The multilayer perceptron (MLP) forms the foundation of deep learning. Despite its relative simplicity, this architecture remains a powerful tool for many machine learning tasks, particularly those involving tabular data.
9.1.1 Structure and Forward Pass
A standard MLP consists of an input layer, one or more hidden layers, and an output layer. Each layer contains multiple neurons, with each neuron in a given layer connected to all neurons in the adjacent layers. This fully-connected structure is why MLPs are also known as dense networks.
Mathematically, the forward pass through a single hidden layer can be expressed as:
\[h = \sigma(W_1 x + b_1)\]
where \(x \in \mathbb{R}^{d_{in}}\) is the input vector, \(W_1 \in \mathbb{R}^{d_{hidden} \times d_{in}}\) is the weight matrix, \(b_1 \in \mathbb{R}^{d_{hidden}}\) is the bias vector, and \(\sigma\) is a non-linear activation function. The output layer then transforms the hidden representation:
\[y = W_2 h + b_2\]
where \(y \in \mathbb{R}^{d_{out}}\) is the output vector, \(W_2 \in \mathbb{R}^{d_{out} \times d_{hidden}}\) is the output weight matrix, and \(b_2 \in \mathbb{R}^{d_{out}}\) is the output bias vector.
For deeper networks with \(L\) hidden layers, the forward pass involves repeated application of affine transformations followed by non-linearities:
\[h_1 = \sigma(W_1 x + b_1)\] \[h_2 = \sigma(W_2 h_1 + b_2)\] \[\vdots\] \[h_L = \sigma(W_L h_{L-1} + b_L)\] \[y = W_{L+1} h_L + b_{L+1}\]
9.1.2 Activation Functions
The choice of activation function \(\sigma\) significantly impacts network behavior. Common activation functions include:
- ReLU (Rectified Linear Unit): \(\sigma(z) = \max(0, z)\)
- Sigmoid: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
- Tanh: \(\sigma(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\)
- LeakyReLU: \(\sigma(z) = \max(\alpha z, z)\) where \(\alpha\) is a small constant (e.g., 0.01)
- GELU (Gaussian Error Linear Unit): \(\sigma(z) = z \cdot \Phi(z)\) where \(\Phi\) is the cumulative distribution function of the standard normal distribution
ReLU is often the default choice due to its computational efficiency and effectiveness in mitigating the vanishing gradient problem. However, LeakyReLU or GELU may offer better performance for deeper networks.
9.1.3 Implementation in PyTorch
PyTorch makes it straightforward to implement MLPs:
import torch
from torch import nn
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dims, output_dim):
super().__init__()
= []
layers = input_dim
prev_dim
# Create hidden layers
for hidden_dim in hidden_dims:
layers.append(nn.Linear(prev_dim, hidden_dim))
layers.append(nn.ReLU())= hidden_dim
prev_dim
# Output layer
layers.append(nn.Linear(prev_dim, output_dim))
self.model = nn.Sequential(*layers)
def forward(self, x):
return self.model(x)
This implementation allows for flexible specification of network depth and width through the hidden_dims
parameter, which accepts a list of integers representing the size of each hidden layer.
9.1.4 Strengths and Limitations
MLPs offer several advantages:
- Universal approximation: Theoretically, MLPs with sufficient hidden units can approximate any continuous function on a compact domain to arbitrary precision.
- Conceptual simplicity: Their straightforward structure makes them easy to understand and implement.
- Versatility: They can handle various data types and learning tasks.
However, they also have important limitations:
- No spatial awareness: MLPs treat inputs as flat vectors, ignoring any inherent structure in the data.
- Parameter inefficiency: The fully-connected nature requires many parameters, which can lead to overfitting.
- Difficulty with sequences: Standard MLPs struggle with variable-length inputs and temporal dependencies.
9.2 Embedding Layers for Categorical Variables
Many real-world datasets contain categorical variables with high cardinality (many possible values), such as user IDs, product codes, or location identifiers. Effectively incorporating these variables into neural networks requires special attention.
9.2.1 Limitations of One-Hot Encoding
The traditional approach to handling categorical variables is one-hot encoding, where each category gets its own binary feature. However, this approach has serious drawbacks for high-cardinality variables:
- Dimensionality explosion: With thousands or millions of possible values, one-hot encoding creates extremely sparse, high-dimensional inputs.
- Memory inefficiency: Storing these sparse vectors wastes memory.
- No semantic relationship: One-hot encoding places all categories equidistant from each other, failing to capture any semantic relationships.
9.2.2 Neural Embeddings
Embedding layers address these limitations by mapping each category to a dense vector in a lower-dimensional space. These vectors are learned during model training, allowing the network to discover meaningful relationships between categories.
Formally, an embedding layer for a categorical variable with \(K\) possible values creates a lookup table \(E \in \mathbb{R}^{K \times d}\), where \(d\) is the embedding dimension. When processing an input with category \(i\), the layer outputs the embedding vector \(E_i \in \mathbb{R}^d\) (the \(i\)-th row of the embedding matrix).
9.2.3 Embedding Dimension
The embedding dimension \(d\) is a hyperparameter that balances expressiveness against complexity. While there’s no universal rule, common heuristics include:
- \(d \approx \sqrt[4]{K}\) where \(K\) is the number of categories
- \(d \in [8, 512]\) with smaller values for simpler relationships and larger values for more complex ones
- \(d \propto \log(K)\) for very large vocabularies
In practice, the optimal embedding dimension often requires experimentation.
9.2.4 Learning Meaningful Representations
Embeddings learn representations based on the prediction task, capturing aspects of the categories that are relevant to the model’s objective.
For example, in our taxi fare prediction task, location embeddings might encode:
- Neighborhood affluence (affecting tip amounts)
- Proximity to tourist attractions (affecting demand)
- Traffic patterns (affecting trip duration)
- Distance from city center (affecting base fares)
These learned representations often reveal intriguing semantic relationships. In natural language processing, word embeddings famously capture analogies like “king - man + woman ≈ queen”.
9.2.5 Implementation in PyTorch
PyTorch provides a dedicated nn.Embedding
layer:
class LocationEmbeddingModel(nn.Module):
def __init__(self, num_locations, embed_dim, other_features_dim):
super().__init__()
self.location_embedding = nn.Embedding(num_locations, embed_dim)
self.fc = nn.Linear(embed_dim + other_features_dim, 1)
def forward(self, location_idx, other_features):
# location_idx: tensor of location indices
# other_features: tensor of other input features
# Get embeddings for locations
= self.location_embedding(location_idx)
loc_embedded
# Concatenate with other features
= torch.cat([loc_embedded, other_features], dim=1)
combined
# Final prediction
return self.fc(combined)
This model embeds location IDs into dense vectors, concatenates them with other features, and then makes predictions through a fully-connected layer.
9.2.7 Pre-trained Embeddings
For some domains, pre-trained embeddings are available (e.g., GloVe or Word2Vec for text). These can provide a strong starting point, especially when training data is limited:
# Initialize with pre-trained embeddings
self.embedding = nn.Embedding.from_pretrained(
torch.FloatTensor(pretrained_vectors),=False # Allow fine-tuning
freeze )
Embeddings have become a cornerstone of modern neural networks, enabling efficient representation learning for categorical data and forming the foundation for many advanced architectures.
9.3 Multiple Block Architectures with Combined Output
Modern neural networks often employ multiple specialized blocks or modules, each designed to capture different aspects of the data or solve different sub-problems. These blocks’ outputs are then combined in diverse ways to produce the final prediction.
9.3.1 Block-Based Design
A block-based architecture separates the network into distinct components with specific roles. This modular approach offers several advantages:
- Specialization: Each block can focus on a specific aspect of the problem.
- Maintainability: Individual blocks can be modified or replaced without affecting the entire network.
- Interpretability: The function of each block can provide insights into the model’s decision-making process.
9.3.2 Conditional Processing Pattern
A particularly powerful pattern involves conditional processing, where one block’s output determines how another block’s output is used. Consider a time series prediction task with variables \(Y_t\) (target) and \(X_t\) (covariates), where we want to predict \(Y_{t+1}\).
We can design a two-block architecture:
- Persistence Block: Predicts the probability \(p\) that \(Y_{t+1} = Y_t\) (i.e., the value remains unchanged).
- Change Block: Predicts the value of \(Y_{t+1}\) conditional on it being different from \(Y_t\).
The final prediction combines these outputs based on the predicted persistence probability:
\[\hat{Y}_{t+1} = p \cdot Y_t + (1 - p) \cdot \text{Change}(X_t, Y_t)\]
where \(\text{Change}(X_t, Y_t)\) is the output of the change block.
9.3.3 Implementation with Trainable Threshold
This approach can be refined by introducing a trainable threshold parameter \(\tau\) that determines when to use the change block’s prediction:
class DualBlockModel(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.persistence_block = nn.Sequential(
64),
nn.Linear(input_dim,
nn.ReLU(),64, 1),
nn.Linear(# Output is probability of persistence
nn.Sigmoid()
)
self.change_block = nn.Sequential(
128),
nn.Linear(input_dim,
nn.ReLU(),128, 64),
nn.Linear(
nn.ReLU(),64, 1) # Output is the predicted new value
nn.Linear(
)
# Trainable threshold parameter (initialized at 0.5)
self.threshold = nn.Parameter(torch.tensor([0.5]))
def forward(self, x, y_t):
# Predict persistence probability
= self.persistence_block(x)
p
# Predict new value (if change occurs)
= self.change_block(x)
new_value
# Compare persistence probability with threshold
= (p > self.threshold).float()
use_persistence
# Final prediction: weighted combination based on threshold comparison
= use_persistence * y_t + (1 - use_persistence) * new_value
y_pred
return y_pred, p
In this implementation, the threshold \(\tau\) becomes a learnable parameter, allowing the model to optimize when to trust the persistence prediction versus the change prediction.
9.3.4 Variants and Extensions
This pattern can be extended in numerous ways:
- Multiple Thresholds: Different thresholds for different regions of the input space.
- Soft Transitions: Replacing the hard threshold with a smooth function.
- Ensemble Approach: Using multiple specialized blocks and a meta-learner to weight their outputs.
- Hierarchical Structure: Organizing blocks in a tree-like structure for hierarchical decision-making.
9.3.5 Attention-Based Combination
An alternative to threshold-based combination is using attention mechanisms to dynamically weight the contributions of different blocks:
class AttentionCombinedModel(nn.Module):
def __init__(self, input_dim, num_blocks=3):
super().__init__()
# Create multiple processing blocks
self.blocks = nn.ModuleList([
nn.Sequential(64),
nn.Linear(input_dim,
nn.ReLU(),64, output_dim)
nn.Linear(for _ in range(num_blocks)
)
])
# Attention mechanism for weighting block outputs
self.attention = nn.Sequential(
64),
nn.Linear(input_dim,
nn.Tanh(),64, num_blocks),
nn.Linear(=1) # Ensures weights sum to 1
nn.Softmax(dim
)
def forward(self, x):
# Get block outputs
= [block(x) for block in self.blocks]
block_outputs = torch.stack(block_outputs, dim=1)
stacked_outputs
# Generate attention weights
= self.attention(x).unsqueeze(2)
weights
# Weighted combination of block outputs
= torch.sum(stacked_outputs * weights, dim=1)
combined
return combined
This approach allows the model to adaptively focus on different processing pathways based on the input, effectively learning which block is most relevant for each sample.
9.3.6 Domain-Specific Block Design
The design of specialized blocks should be informed by domain knowledge. For economic time series, blocks might specialize in:
- Trend components
- Seasonal patterns
- Exogenous shocks
- Mean-reversion dynamics
By encoding such domain insights into the architecture, we create models that better align with the underlying data-generating processes.
9.4 Multiple Outcomes with Separate Losses
Many real-world problems require predicting multiple related outputs simultaneously. For instance, a model might need to predict both a worker’s future sector of employment and their expected wages. These multi-output scenarios present unique challenges and opportunities for neural network design.
9.4.1 Multi-Task Learning Framework
Multi-task learning trains a single model to perform multiple related tasks, leveraging shared representations to improve performance across all tasks. This approach has several benefits:
- Data Efficiency: Learning shared representations from multiple tasks can reduce the data needed for each task.
- Regularization: Additional tasks can act as regularizers for the primary task.
- Feature Importance: The model can leverage complementary information across tasks.
Formally, a multi-task model learns a mapping:
\[f: \mathbb{R}^d \rightarrow \mathbb{R}^{m_1} \times \mathbb{R}^{m_2} \times \cdots \times \mathbb{R}^{m_k}\]
where \(d\) is the input dimension and \(m_i\) is the output dimension for task \(i\).
9.4.2 Architecture Design
Multi-output architectures typically feature:
- Shared Layers: Initial layers that learn common representations
- Task-Specific Heads: Specialized output layers for each task
- Custom Loss Functions: Appropriate loss functions for each output type
Here’s a model that predicts both employment sector (categorical) and wages (continuous):
class WorkerPredictionModel(nn.Module):
def __init__(self, input_dim, shared_dim, num_sectors):
super().__init__()
# Shared representation layers
self.shared_network = nn.Sequential(
256),
nn.Linear(input_dim,
nn.ReLU(),256, shared_dim),
nn.Linear(
nn.ReLU()
)
# Sector prediction head
self.sector_head = nn.Sequential(
128),
nn.Linear(shared_dim,
nn.ReLU(),128, num_sectors)
nn.Linear(# No softmax here - will use CrossEntropyLoss which includes softmax
)
# Wage prediction head
self.wage_head = nn.Sequential(
128),
nn.Linear(shared_dim,
nn.ReLU(),128, 1)
nn.Linear(
)
def forward(self, x):
= self.shared_network(x)
shared_features = self.sector_head(shared_features)
sector_logits = self.wage_head(shared_features)
wage_prediction
return sector_logits, wage_prediction
9.4.3 Combined Loss Function
Training multi-output models requires combining the losses from each task. The simplest approach is a weighted sum:
\[\mathcal{L}_{\text{total}} = \sum_{i=1}^k \lambda_i \mathcal{L}_i\]
where \(\mathcal{L}_i\) is the loss for task \(i\) and \(\lambda_i\) is its weight.
def training_step(self, batch, batch_idx):
= batch
x, sector_true, wage_true
# Forward pass
= self(x)
sector_logits, wage_pred
# Task-specific losses
= F.cross_entropy(sector_logits, sector_true)
sector_loss = F.mse_loss(wage_pred, wage_true)
wage_loss
# Combined loss with weighting
= 0.7 * sector_loss + 0.3 * wage_loss
total_loss
self.log("sector_loss", sector_loss)
self.log("wage_loss", wage_loss)
self.log("total_loss", total_loss)
return total_loss
9.4.4 Handling Loss Scale Disparities
Different tasks often produce losses with different scales, which can bias training toward tasks with larger loss values. Several techniques address this issue:
- Task Weighting: Manually assign weights to balance task contributions.
- Uncertainty Weighting: Learn the optimal weights during training based on task uncertainty.
- Gradient Normalization: Scale gradients to have similar magnitudes across tasks.
- Loss Normalization: Normalize each loss by its moving average.
9.4.5 Adaptive Loss Weighting
An elegant approach to loss weighting involves learning task weights based on uncertainty, as proposed by Kendall et al. (2018):
class UncertaintyWeightedModel(nn.Module):
def __init__(self, input_dim, shared_dim, num_sectors):
super().__init__()
# ... same architecture as WorkerPredictionModel ...
# Learnable log variances for loss weighting
self.log_var_sector = nn.Parameter(torch.zeros(1))
self.log_var_wage = nn.Parameter(torch.zeros(1))
def forward(self, x):
# ... same forward pass ...
return sector_logits, wage_prediction
def training_step(self, batch, batch_idx):
= batch
x, sector_true, wage_true
# Forward pass
= self(x)
sector_logits, wage_pred
# Task-specific losses
= F.cross_entropy(sector_logits, sector_true)
sector_loss = F.mse_loss(wage_pred, wage_true)
wage_loss
# Uncertainty-weighted loss
= torch.exp(-self.log_var_sector)
precision_sector = torch.exp(-self.log_var_wage)
precision_wage
= (precision_sector * sector_loss + 0.5 * self.log_var_sector) + \
total_loss * wage_loss + 0.5 * self.log_var_wage)
(precision_wage
return total_loss
This approach automatically balances tasks by learning their relative importances.
9.4.6 Task Relationships
Understanding the relationships between tasks can inform better architecture design:
- Task Hierarchy: Some tasks naturally build upon others.
- Auxiliary Tasks: Secondary tasks that help learn useful representations.
- Competing Tasks: Tasks with conflicting objectives that require careful balancing.
- Sequential Dependencies: Tasks where one output influences another.
For our worker prediction example, we might leverage the relationship between sector and wages by making wage prediction conditionally dependent on the predicted sector:
def forward(self, x):
= self.shared_network(x)
shared_features
# Predict sector first
= self.sector_head(shared_features)
sector_logits = F.softmax(sector_logits, dim=1)
sector_probs
# Condition wage prediction on sector probabilities
= torch.cat([shared_features, sector_probs], dim=1)
combined_features = self.wage_head(combined_features)
wage_prediction
return sector_logits, wage_prediction
This captures the real-world relationship where wages depend partly on employment sector.
9.5 Dropout
As neural networks grow deeper and wider, they become increasingly susceptible to overfitting—memorizing training data rather than learning generalizable patterns. Dropout is a simple yet remarkably effective regularization technique that addresses this problem.
9.5.1 The Dropout Mechanism
Dropout, introduced by Srivastava et al. (2014), works by randomly “dropping” (setting to zero) a fraction of neurons during each training iteration. Mathematically, for each neuron in a layer, we apply:
\[ y = \begin{cases} 0 & \text{with probability } p \\ \frac{x}{1-p} & \text{with probability } 1-p \end{cases} \]
where \(p\) is the dropout rate (typically 0.2-0.5), \(x\) is the neuron’s original output, and \(y\) is the final output. The scaling factor \(\frac{1}{1-p}\) ensures that the expected value of the output remains unchanged.
During inference (testing), no neurons are dropped, but their outputs are scaled by \(1-p\) to maintain the same expected value—though in practice, many frameworks implement this scaling during training instead of inference.
9.5.2 Conceptual Understanding
Dropout can be understood from several perspectives:
Ensemble Interpretation: Each training iteration samples a different “thinned” network from the exponentially many possible subnetworks. This effectively trains an ensemble of networks that share parameters.
Co-adaptation Prevention: Neurons can’t rely on specific other neurons being present, forcing them to learn robust features useful in multiple contexts.
Noise Robustness: By injecting noise into the hidden activations, dropout encourages models to learn representations that are robust to perturbations.
9.5.3 Implementation in PyTorch
PyTorch provides a straightforward Dropout
module:
class DropoutMLP(nn.Module):
def __init__(self, input_dim, hidden_dims, output_dim, dropout_rate=0.5):
super().__init__()
= []
layers = input_dim
prev_dim
for i, hidden_dim in enumerate(hidden_dims):
layers.append(nn.Linear(prev_dim, hidden_dim))
layers.append(nn.ReLU())
# Apply dropout to all hidden layers except the last one
if i < len(hidden_dims) - 1:
layers.append(nn.Dropout(dropout_rate))
= hidden_dim
prev_dim
# Output layer
layers.append(nn.Linear(prev_dim, output_dim))
self.model = nn.Sequential(*layers)
def forward(self, x):
return self.model(x)
9.5.4 Dropout Variants
Several variants of dropout have been proposed to address specific challenges:
Spatial Dropout: Drops entire feature maps in convolutional networks, preserving spatial coherence.
DropConnect: Randomly drops weights rather than activations, providing a different form of regularization.
Concrete Dropout: Learns the optimal dropout rate for each layer during training.
Alpha Dropout: Designed for self-normalizing neural networks, preserves the mean and variance of inputs.
Variational Dropout: Uses Bayesian principles to determine which weights to drop.
9.5.5 Placement and Rate Selection
Effective dropout implementation requires careful consideration of:
Placement: Typically applied after activation functions in hidden layers, but not usually on input features or output layers.
Rate Selection: Higher dropout rates (e.g., 0.5) for layers with many parameters, lower rates (e.g., 0.2) for layers with fewer parameters.
Model Size Adjustment: When using dropout, increasing model size often improves performance by counteracting the regularization effect.
9.5.6 Interaction with Other Techniques
Dropout interacts with other aspects of neural network training:
Learning Rate: Models with dropout often benefit from higher learning rates.
Weight Decay: The combination of dropout and weight decay can provide complementary forms of regularization.
Batch Normalization: Batch normalization and dropout can work together but require careful ordering (more on this in the next section).
Data Augmentation: Both techniques inject noise but in different spaces (feature space vs. input space).
9.6 Batch Normalization
Training deep neural networks is challenging partly due to the phenomenon of internal covariate shift—the change in the distribution of network activations due to updates to preceding layers. Batch Normalization (BatchNorm), introduced by Ioffe and Szegedy (2015), addresses this issue by normalizing layer inputs, dramatically improving training stability and speed.
9.6.1 The BatchNorm Operation
For a mini-batch of activations \(\{x_1, x_2, ..., x_m\}\), BatchNorm performs the following transformation:
Calculate batch mean: \(\mu_B = \frac{1}{m} \sum_{i=1}^m x_i\)
Calculate batch variance: \(\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2\)
Normalize: \(\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\) where \(\epsilon\) is a small constant for numerical stability
Scale and shift: \(y_i = \gamma \hat{x}_i + \beta\) where \(\gamma\) and \(\beta\) are learnable parameters
This process ensures that each feature has approximately zero mean and unit variance, while the learnable parameters \(\gamma\) and \(\beta\) allow the network to undo the normalization if needed.
9.6.2 Benefits of Batch Normalization
Batch Normalization offers several key advantages:
Faster Convergence: By reducing internal covariate shift, BatchNorm allows higher learning rates and accelerates training.
Regularization Effect: The batch statistics introduce noise, providing a regularizing effect similar to dropout.
Reduced Sensitivity to Initialization: BatchNorm makes networks more robust to poor weight initialization.
Gradient Flow: Normalization helps prevent exploding or vanishing gradients in deep networks.
Smoother Optimization Landscape: BatchNorm smooths the optimization landscape, making it easier to navigate.
9.6.3 Implementation in PyTorch
Implementing BatchNorm in PyTorch is straightforward:
class BatchNormMLP(nn.Module):
def __init__(self, input_dim, hidden_dims, output_dim):
super().__init__()
= []
layers = input_dim
prev_dim
for hidden_dim in hidden_dims:
layers.append(nn.Linear(prev_dim, hidden_dim))
layers.append(nn.BatchNorm1d(hidden_dim))
layers.append(nn.ReLU())= hidden_dim
prev_dim
# Output layer
layers.append(nn.Linear(prev_dim, output_dim))
self.model = nn.Sequential(*layers)
def forward(self, x):
return self.model(x)
Note that BatchNorm is applied after the linear transformation but before the activation function.
9.6.4 Inference Behavior
During inference, BatchNorm uses a moving average of mean and variance calculated during training, rather than batch statistics:
\[\hat{x} = \frac{x - E[x]}{\sqrt{Var[x] + \epsilon}}\]
PyTorch handles this transition automatically through the model.train()
and model.eval()
methods.
9.6.5 BatchNorm Variants
Several variants of BatchNorm exist for different scenarios:
Layer Normalization: Normalizes across features for each sample independently, useful for recurrent networks and when batch size is small.
Instance Normalization: Normalizes each channel in each sample independently, popular in style transfer.
Group Normalization: Divides channels into groups and normalizes within each group, providing a middle ground between Layer and Instance Normalization.
Weight Normalization: Normalizes the weights rather than the activations.
9.6.6 Placement Considerations
The placement of BatchNorm layers requires careful consideration:
Pre-activation vs. Post-activation: Most commonly used pre-activation, i.e., after the linear operation but before the non-linearity.
Interaction with Dropout: Typically, BatchNorm is applied before dropout to normalize the activations that dropout will randomly zero out.
First Layer: BatchNorm is sometimes omitted after the first layer when inputs are already normalized.
Last Layer: BatchNorm is typically not applied after the output layer to avoid constraining the output distribution.
9.6.7 BatchNorm and Residual Networks
BatchNorm is particularly effective when combined with residual connections (discussed in the next section). The standard placement in residual blocks is:
Conv → BatchNorm → ReLU → Conv → BatchNorm → Add → ReLU
This arrangement ensures that the normalized activations are passed through the non-linearity and that the residual addition occurs before the final activation.
9.7 Residual Networks
Training very deep neural networks has historically been challenging due to optimization difficulties, vanishing/exploding gradients, and degradation problems. Residual Networks (ResNets), introduced by He et al. (2016), address these issues through a simple yet profound architectural innovation: skip connections.
9.7.1 The Residual Block
The core innovation of ResNets is the residual block, which can be expressed as:
\[y = F(x, W) + x\]
where \(x\) is the input to the block, \(F(x, W)\) is a residual mapping (typically a sequence of layers with weights \(W\)), and \(y\) is the output. The direct addition of the input \(x\) creates a shortcut connection that bypasses the residual mapping.
Intuitively, instead of learning a direct mapping \(H(x)\) from input to output, the network learns the residual mapping \(F(x) = H(x) - x\). This approach makes it easier for the network to learn identity mappings when optimal, allowing the effective training of much deeper networks.
9.7.2 Gradient Flow in ResNets
The success of ResNets can be attributed to improved gradient flow during backpropagation. Consider a loss function \(L\) and its gradient with respect to the input \(x\):
\[\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \left(\frac{\partial F(x)}{\partial x} + 1\right)\]
The constant term 1 ensures that gradients can flow directly from later layers to earlier ones, mitigating the vanishing gradient problem.
9.7.3 Basic Implementation in PyTorch
A basic residual block in PyTorch might look like:
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(channels)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
= x
residual
= self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out
= self.conv2(out)
out = self.bn2(out)
out
+= residual # Add the shortcut connection
out = self.relu(out)
out
return out
For the case of dense layers, which are more common in economic applications, a residual block might look like:
class DenseResidualBlock(nn.Module):
def __init__(self, dim):
super().__init__()
self.fc1 = nn.Linear(dim, dim)
self.bn1 = nn.BatchNorm1d(dim)
self.fc2 = nn.Linear(dim, dim)
self.bn2 = nn.BatchNorm1d(dim)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
= x
residual
= self.fc1(x)
out = self.bn1(out)
out = self.relu(out)
out
= self.fc2(out)
out = self.bn2(out)
out
+= residual
out = self.relu(out)
out
return out
9.7.4 Dimension Matching
When the input and output dimensions differ, the shortcut connection must be adjusted. Common approaches include:
- Zero Padding: Pad the shortcut connection with zeros to match dimensions.
- Projection Shortcut: Use a linear transformation (usually 1×1 convolution or linear layer) to project the input to the desired dimension.
class DimensionChangingBlock(nn.Module):
def __init__(self, in_dim, out_dim):
super().__init__()
self.fc1 = nn.Linear(in_dim, out_dim)
self.bn1 = nn.BatchNorm1d(out_dim)
self.fc2 = nn.Linear(out_dim, out_dim)
self.bn2 = nn.BatchNorm1d(out_dim)
# Projection shortcut for dimension matching
self.shortcut = nn.Sequential(
nn.Linear(in_dim, out_dim),
nn.BatchNorm1d(out_dim)
)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
= self.shortcut(x)
residual
= self.fc1(x)
out = self.bn1(out)
out = self.relu(out)
out
= self.fc2(out)
out = self.bn2(out)
out
+= residual
out = self.relu(out)
out
return out
9.7.5 ResNet Variations and Extensions
The basic ResNet architecture has inspired numerous variations:
Pre-activation ResNet: Moves the batch normalization and activation before the convolution, improving gradient flow.
Wide ResNet: Uses wider layers with fewer blocks, achieving similar performance with reduced depth.
ResNeXt: Introduces a cardinality dimension by using grouped convolutions within residual blocks.
DenseNet: Instead of simple addition, concatenates features from earlier layers, creating dense connections.
SE-ResNet: Incorporates Squeeze-and-Excitation blocks that adaptively recalibrate channel-wise feature responses.
9.7.6 Bottleneck Architecture
For very deep networks, a bottleneck architecture reduces computational complexity:
class BottleneckBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1, expansion=4):
super().__init__()
= out_channels // expansion
bottleneck_channels
self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, kernel_size=1)
self.bn1 = nn.BatchNorm2d(bottleneck_channels)
self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels,
=3, stride=stride, padding=1)
kernel_sizeself.bn2 = nn.BatchNorm2d(bottleneck_channels)
self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, kernel_size=1)
self.bn3 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
# Shortcut connection if dimensions change
self.shortcut = nn.Identity()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
=1, stride=stride),
nn.Conv2d(in_channels, out_channels, kernel_size
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
= self.shortcut(x)
residual
= self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out
= self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out
= self.conv3(out)
out = self.bn3(out)
out
+= residual
out = self.relu(out)
out
return out
This bottleneck design uses 1×1 convolutions to reduce and then restore dimensions, with a 3×3 convolution in between, significantly reducing the number of parameters and computations.
9.7.7 Application Beyond Computer Vision
While ResNets were initially developed for image classification, the concept of residual connections has proven valuable across domains:
Natural Language Processing: Transformers use residual connections around self-attention and feed-forward layers.
Time Series Analysis: Residual connections can help models capture both short-term fluctuations and long-term trends.
Tabular Data: Dense residual blocks improve performance on structured data problems common in economics and finance.
Generative Models: Many state-of-the-art generative architectures incorporate residual connections.
The widespread adoption of residual connections across diverse domains underscores their fundamental importance in deep learning architecture design.
9.8 Conclusion
This chapter has explored a range of neural network architectures that form the foundation of modern deep learning. We began with the simple multilayer perceptron and progressively examined more sophisticated designs: embedding layers for categorical data, multi-block architectures with conditional processing, multi-output networks, and powerful regularization techniques including dropout and batch normalization. We concluded with residual networks, which have revolutionized the training of very deep neural networks.
Several key principles emerge from this exploration:
Problem-Specific Design: Neural network architecture should reflect the specific structure and requirements of the problem domain.
Modularity: Breaking networks into specialized components enhances interpretability and facilitates experimentation.
Information Flow: Many architectural innovations focus on improving the flow of information (and gradients) through the network.
Regularization: Techniques like dropout and batch normalization help prevent overfitting while enabling more efficient training.
Knowledge Transfer: While many architectures originated in computer vision or NLP, their principles transfer effectively to economic and financial applications.
In the next chapter, we will explore how to implement these architectural concepts efficiently using PyTorch Lightning, focusing on practical aspects like logging, checkpointing, early stopping, and hyperparameter search. These implementation details complement the architectural principles discussed here, enabling you to build and train sophisticated neural networks for real-world applications.