9  lightning and Advanced NN Techniques

In previous chapters, we delved into the theoretical foundations of neural networks and explored their implementation with PyTorch. Now, we advance to more sophisticated methods for neural network training and architecture design. These techniques represent the modern practice of deep learning, enabling more efficient model development, improved performance, and increased reproducibility.

This chapter introduces PyTorch Lightning, a high-level framework that streamlines neural network implementation while maintaining PyTorch’s flexibility. We will also explore embedding layers for categorical data, examine regularization techniques such as dropout and batch normalization, and utilize TensorBoard for monitoring model training. Building on our NYC taxi fare prediction task, we will progressively enhance our models while observing how each improvement affects performance.

9.1 Loading and Cleaning the Data and Preparing DataLoaders

We begin with our familiar NYC taxi dataset. As in previous chapters, we’ll download the data and perform initial cleaning operations.

from pathlib import Path
import requests

local_path = Path('data/fhvhv_tripdata_2024-01.parquet')

url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2024-01.parquet'

if not local_path.exists():
    local_path.parent.mkdir(exist_ok=True)
    local_path.write_bytes(requests.get(url).content)

Now, we import the necessary libraries. Note that we’re adding Lightning to our toolkit:

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
import torch
from torch import nn, optim, utils
import lightning as L

# Set random seed for reproducibility
L.pytorch.seed_everything(110, workers=True)

The L.pytorch.seed_everything() function ensures reproducibility by setting random seeds for all relevant libraries (PyTorch, NumPy, Python’s random) in a single call. The workers=True parameter ensures that data loading workers also maintain deterministic behavior.

Let’s load and prepare our dataset:

df = pd.read_parquet('data/fhvhv_tripdata_2024-01.parquet',
                     columns = ['hvfhs_license_num','request_datetime',
                                'trip_miles','trip_time','base_passenger_fare',
                                'driver_pay','PULocationID','DOLocationID']).sample(1_000_000)

# Clean the data by filtering outliers
df = df[(df['trip_miles']>=1) 
        & (df['trip_miles']<=20) 
        & (df['base_passenger_fare']<200)]

# Feature engineering
df['request_day_of_week'] = df['request_datetime'].dt.dayofweek
df['request_hour_of_day'] = df['request_datetime'].dt.hour
df['fare_per_mile'] = df['base_passenger_fare']/df['trip_miles']

Note that we’re sampling 1 million records from the dataset to make training more manageable. While this is still a substantial amount of data, it allows for faster experimentation without overly compromising model quality.

Next, we define our feature sets and split the data:

categorical_features = ['hvfhs_license_num', 'request_day_of_week', 'request_hour_of_day']
numerical_features = ['trip_miles', 'trip_time']

X = df[categorical_features+numerical_features+['PULocationID','DOLocationID']]
y = df['fare_per_mile']

# Create training, validation, and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y,
                                    test_size=0.1, random_state=100)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val,
                                    test_size=0.1, random_state=100)
# Free memory
del X_train_val
del y_train_val

Notice that we’ve created three distinct datasets:

  • Training set (81% of data): Used to update model parameters
  • Validation set (9% of data): Used to tune hyperparameters and monitor for overfitting
  • Test set (10% of data): Used for final evaluation only

This three-way split is a standard practice in deep learning. The validation set helps us monitor the model’s generalization ability during training and make informed decisions about hyperparameters, while the test set provides an unbiased final evaluation.

For the preprocessing of our numerical and categorical features, we’ll use scikit-learn’s ColumnTransformer:

ct = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features)
])

ct.fit(X_train)

We’re treating the pickup and dropoff location IDs (PULocationID and DOLocationID) differently, as we’ll use embedding layers for them:

# Get min and max values for location IDs to determine embedding sizes
PUmax, PUmin = df['PULocationID'].max(), df['PULocationID'].min()
PUvals = PUmax - PUmin + 1
DOmax, DOmin = df['DOLocationID'].max(), df['DOLocationID'].min()
DOvals = DOmax - DOmin + 1

Now, let’s create a function to process our data and turn it into PyTorch’s TensorDataset:

def mk_dataset(X, y):
    # Transform features with ColumnTransformer
    X_trans = ct.transform(X).astype(np.float32)
    return utils.data.TensorDataset(
        torch.from_numpy(X_trans),
        torch.from_numpy(X['PULocationID'].values - PUmin),
        torch.from_numpy(X['DOLocationID'].values - DOmin),
        torch.from_numpy(y.values.astype(np.float32))
    )

This function: 1. Transforms the input features using our preprocessor 2. Converts the normalized location IDs to zero-based indices (required for embedding layers) 3. Assembles a TensorDataset with four components: transformed features, pickup location indices, dropoff location indices, and target values

Finally, we create our DataLoaders:

# Create dataloaders
train_dl = utils.data.DataLoader(mk_dataset(X_train, y_train),
                               shuffle=True, batch_size=1024, num_workers=4)
val_dl = utils.data.DataLoader(mk_dataset(X_val, y_val),
                             batch_size=1024, num_workers=4)
test_dl = utils.data.DataLoader(mk_dataset(X_test, y_test),
                              batch_size=1024, num_workers=4)

Note these important DataLoader parameters: - shuffle=True for the training set ensures that each epoch sees a different order of samples, which helps prevent overfitting - batch_size=1024 defines how many samples are processed in each iteration - num_workers=4 enables parallel data loading, which can significantly speed up training

9.2 Why Use Lightning

Before we start implementing our models, let’s discuss why we’re using PyTorch Lightning in this chapter. Traditional PyTorch code, while flexible, often contains repetitive boilerplate code that can obscure the core model logic.

PyTorch Lightning is a lightweight wrapper around PyTorch that provides a structured organization for neural network code while preserving all of PyTorch’s flexibility. It offers several advantages:

  1. Code Organization: Lightning enforces a clean separation between research code (model architecture, loss functions) and engineering code (training loops, GPU handling, distributed training).

  2. Reduced Boilerplate: Common operations like moving tensors to the correct device, gradient calculation, and parameter updates are handled automatically.

  3. Built-in Features: Lightning provides out-of-the-box support for logging, checkpointing, early stopping, and other training utilities.

  4. Scalability: The same code can easily scale from a single CPU to multiple GPUs or even multiple machines with minimal changes.

  5. Reproducibility: Lightning makes it easier to ensure consistent results by standardizing the training process.

Lightning is particularly beneficial for economics research, where reproducibility and transparency are paramount. By separating the scientific components (model definition, hyperparameters) from the engineering details, Lightning makes it easier to communicate and replicate research findings.

The core abstraction in Lightning is the LightningModule class, which encapsulates: - The model architecture (__init__ method) - The forward pass (forward method) - Training, validation, and test logic (training_step, validation_step, test_step methods) - Optimization configuration (configure_optimizers method)

This organization makes the code more readable and maintainable while reducing potential sources of error.

9.3 Setting Up a Base Model Class in Lightning

Let’s create a base class that implements the common functionality needed by all our models:

class BasicModel(L.LightningModule):
    def __init__(self, lr=1e-4):
        super().__init__()
        self.lr = lr

    def training_step(self, batch, batch_idx):
        y, y_hat = self.common_step(batch, batch_idx)
        loss = nn.functional.mse_loss(y, y_hat)
        self.log("training_loss", loss)
        return loss

    def test_step(self, batch, batch_idx):
        y, y_hat = self.common_step(batch, batch_idx)
        loss = nn.functional.mse_loss(y, y_hat)
        self.log("test_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        y, y_hat = self.common_step(batch, batch_idx)
        loss = nn.functional.mse_loss(y, y_hat)
        self.log("val_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), self.lr)
        return optimizer

This BasicModel class:

  • Inherits from L.LightningModule
  • Accepts a learning rate parameter
  • Defines methods for training, validation, and testing steps
  • Uses mean squared error (MSE) as the loss function
  • Configures the Adam optimizer
  • Logs the loss for each phase (training, validation, test)

The common_step method is not implemented in the base class—it will be provided by each specific model subclass. This follows the template method pattern from software design: the base class defines the general algorithm structure, while subclasses implement the specific details.

9.4 Setting Up a Trivial Model in Lightning

Now, let’s implement a simple model that doesn’t use embedding layers but treats location IDs as regular numerical features:

class TrivialModel(BasicModel):
    def __init__(self, other_dim, lr=1e-4):
        # other_dim is the dimension of features other than PU and DO
        super().__init__(lr)
        self.model = nn.Sequential(
            nn.Linear(other_dim + 2, 512),  # +2 for the two location IDs
            nn.ReLU(),
            nn.Linear(512, 1)
        )
        
    def common_step(self, batch, batch_idx):
        x, pu, do, y = batch
        # Concatenate all features
        X = torch.hstack((x, torch.unsqueeze(pu, 1), torch.unsqueeze(do, 1)))
        y_hat = self.model(X)
        y = y.view(-1, 1)  # Reshape target to match prediction shape
        return y, y_hat

The TrivialModel:

  • Treats location IDs as regular numerical features
  • Uses a simple network with one hidden layer of 512 units
  • Concatenates all input features into a single tensor
  • Implements the common_step method required by our base class

Before training, we need to determine the input dimension:

other_dim = train_dl.dataset[0][0].shape[0]

This extracts the dimension of the preprocessed features (excluding location IDs) from the first element of our training dataset.

9.5 Training and Testing

With our model defined, we can now train and evaluate it:

trivial_trainer = L.Trainer(deterministic=True, max_epochs=10)
trivial_model = TrivialModel(other_dim)
trivial_trainer.fit(trivial_model, train_dl, val_dl)
trivial_trainer.test(trivial_model, test_dl)

The L.Trainer class automates the training process. We specify: - deterministic=True to ensure reproducible results - max_epochs=10 to limit the number of training epochs

The fit method handles the entire training process, including:

  • Iterating through epochs
  • Processing batches
  • Computing gradients
  • Updating parameters
  • Validating after each epoch
  • Logging metrics

The test method evaluates the model on the test set after training is complete.

One of Lightning’s key advantages is that all these steps are handled automatically, with sensible defaults that can be customized when needed. This dramatically reduces the amount of code we need to write while ensuring best practices are followed.

9.6 Embedding Layer in Neural Networks

Our trivial model treats location IDs as regular numerical features, but this approach has limitations. Location IDs are categorical variables with no inherent ordering—treating them as numbers can introduce arbitrary relationships that don’t exist in reality.

A more appropriate approach for categorical variables with high cardinality (many possible values) is to use embedding layers. An embedding layer maps each categorical value to a dense vector in a lower-dimensional space, learning meaningful representations during training.

9.6.1 Understanding Embeddings

An embedding layer can be thought of as a lookup table: given a categorical ID, it returns a corresponding vector. The embedding vectors are learned parameters that the model optimizes during training.

Formally, for a categorical variable with \(K\) possible values, an embedding layer creates a matrix \(E \in \mathbb{R}^{K \times d}\) where \(d\) is the embedding dimension. When processing an input with category \(i\), the layer outputs the vector \(E_i \in \mathbb{R}^d\) (the \(i\)-th row of the embedding matrix).

Embeddings offer several advantages:

  • They capture semantic relationships between categories
  • They reduce dimensionality compared to one-hot encoding
  • They learn useful representations based on the prediction task

In our taxi fare prediction task, embeddings can learn representations of locations that capture relevant factors like neighborhood affluence, distance from tourist attractions, or traffic patterns—all of which might affect fare prices.

9.6.2 Adding Embeddings to the Model

Let’s implement a model that uses embedding layers for the pickup and dropoff locations:

class Model3(BasicModel):
    def __init__(self, other_dim, embed_dim=16, lr=1e-4):
        super().__init__(lr)
        self.model = nn.Sequential(
            nn.Linear(other_dim + 2*embed_dim, 1024),
            nn.ReLU(),
            nn.Linear(1024, 1)
        )
        self.pu_embed = torch.nn.Embedding(PUvals, embed_dim)
        self.do_embed = torch.nn.Embedding(DOvals, embed_dim)
        
    def common_step(self, batch, batch_idx):
        x, pu, do, y = batch
        pu_vec = self.pu_embed(pu)
        do_vec = self.do_embed(do)
        X = torch.hstack((x, pu_vec, do_vec))
        y_hat = self.model(X)
        y = y.view(-1, 1)
        return y, y_hat

This model:

  • Creates two embedding layers, one for pickup locations and one for dropoff locations
  • Sets the embedding dimension to 16 (a hyperparameter we can tune)
  • Extracts embedding vectors for each location ID
  • Concatenates these embedding vectors with the other features
  • Processes the combined features through a neural network

The size of an embedding (embed_dim) is typically a hyperparameter that needs tuning, but a common rule of thumb is to use \(\text{embed\_dim} \approx \sqrt[4]{n}\) where \(n\) is the number of possible categories. However, embedding dimensions between 8 and 512 are common, with smaller values for simpler relationships and larger values for more complex ones.

Now let’s train this model:

model3 = Model3(other_dim, embed_dim=32)
trainer3 = L.Trainer(deterministic=True, max_epochs=15)
trainer3.fit(model3, train_dl, val_dl)
trainer3.test(model3, test_dl)

Notice that we’ve increased the embedding dimension to 32 and the number of epochs to 15, allowing the model more capacity and training time to learn meaningful representations.

9.7 Using TensorBoard for Monitoring

While the logging we’ve done so far provides basic metrics, we often need more detailed insights into the training process. TensorBoard is a visualization toolkit that provides graphical representations of model metrics over time.

Lightning integrates seamlessly with TensorBoard. To use it, we simply add a TensorBoard logger to our trainer:

from lightning.pytorch.loggers import TensorBoardLogger

logger = TensorBoardLogger("tb_logs", name="taxi_fare_model")
trainer = L.Trainer(logger=logger, max_epochs=15)

Lightning automatically logs metrics defined with the self.log() method. For more complex logging, we can use the log_dict() method to track multiple metrics at once.

To enhance our monitoring, we can add additional metrics to our model’s validation step:

def validation_step(self, batch, batch_idx):
    y, y_hat = self.common_step(batch, batch_idx)
    loss = nn.functional.mse_loss(y, y_hat)
    mae = torch.abs(y - y_hat).mean()
    self.log_dict({
        "val_loss": loss,
        "val_mae": mae,
    })
    return loss

This adds the mean absolute error (MAE) to our logged metrics, providing another perspective on model performance.

TensorBoard offers several visualization tools:

  • Scalars: Plots of metrics over time
  • Distributions: Histograms of parameter values
  • Gradients: Statistics on parameter gradients
  • Images: Visualizations of model inputs/outputs
  • Graphs: Computational graph visualization
  • Embeddings: Projections of high-dimensional embeddings

To view TensorBoard while training or after, run:

%load_ext tensorboard
%tensorboard --logdir tb_logs

This provides a real-time view of your model’s performance, helping you identify issues like overfitting (validation loss increases while training loss continues to decrease) or learning rate problems (erratic loss curves).

9.8 Dropout and Batch Normalization

As we build more complex models, we need to consider regularization techniques to prevent overfitting. Two powerful regularization methods in deep learning are dropout and batch normalization.

9.8.1 Dropout

Dropout is a regularization technique that randomly sets a fraction of input units to zero during training, which helps prevent co-adaptation of neurons. During testing, all neurons are active but with scaled-down weights.

The intuition behind dropout is similar to ensemble methods: by randomly dropping neurons, we’re effectively training a different sub-network at each training step. This forces the network to learn redundant representations and not rely too heavily on any single neuron.

9.8.2 Batch Normalization

Batch normalization addresses the problem of internal covariate shift, where the distribution of each layer’s inputs changes during training as the parameters of the previous layers change. It normalizes the layer inputs to have zero mean and unit variance across each mini-batch, then applies a learnable scale and shift.

Batch normalization offers several benefits:

  • Faster training (allows higher learning rates)
  • Reduced sensitivity to initialization
  • Acts as a form of regularization
  • Improves gradient flow in deep networks

Let’s implement a model that incorporates both dropout and batch normalization:

class Model5(BasicModel):
    def __init__(self, other_dim, embed_dim=16, inner_dim=64, lr=1e-4):
        super().__init__(lr)
        n = inner_dim
        input_dim = other_dim + 2*embed_dim
        
        self.block1 = nn.Sequential(
            nn.Linear(input_dim, 4*n),
            nn.BatchNorm1d(4*n),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(4*n, 2*n),
            nn.BatchNorm1d(2*n),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(2*n, n),
            nn.BatchNorm1d(n),
            nn.ReLU(),
            nn.Linear(n, input_dim)
        )
        
        self.block2 = nn.Sequential(
            nn.Linear(input_dim, 4*n),
            nn.BatchNorm1d(4*n),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(4*n, 2*n),
            nn.BatchNorm1d(2*n),
            nn.ReLU(),
            nn.Linear(2*n, 1)
        )
        
        self.pu_embed = torch.nn.Embedding(PUvals, embed_dim)
        self.do_embed = torch.nn.Embedding(DOvals, embed_dim)
        
    def common_step(self, batch, batch_idx):
        x, pu, do, y = batch
        pu_vec = self.pu_embed(pu)
        do_vec = self.do_embed(do)
        X = torch.hstack((x, pu_vec, do_vec))
        out1 = self.block1(X)
        y_hat = self.block2(X + out1)  # Residual connection
        y = y.view(-1, 1)
        return y, y_hat

This model includes:

  • Two embedding layers for location IDs
  • Batch normalization after each linear layer
  • Dropout with a rate of 0.25 after several layers
  • A residual connection (X + out1) that helps gradient flow in deep networks

The model structure is also more complex, with two “blocks” of layers. The first block processes the input and produces an output with the same dimension, which is then added to the original input (a residual connection). The second block further processes this combined representation to produce the final prediction.

Let’s train this model:

model5 = Model5(other_dim, embed_dim=32)
trainer5 = L.Trainer(deterministic=True, max_epochs=15)
trainer5.fit(model5, train_dl, val_dl)
trainer5.test(model5, test_dl)

In practice, you might want to experiment with different dropout rates and batch normalization configurations. The commented-out lines in the code snippet show alternative configurations that you could try.

9.9 A More Complicated Model

Building upon our knowledge of embeddings, batch normalization, and dropout, let’s implement an even more sophisticated model that incorporates additional architectural elements:

class AdvancedModel(BasicModel):
    def __init__(self, other_dim, embed_dim=32, inner_dim=128, lr=1e-4):
        super().__init__(lr)
        n = inner_dim
        input_dim = other_dim + 2*embed_dim
        
        # Feature extraction network
        self.feature_net = nn.Sequential(
            nn.Linear(input_dim, 4*n),
            nn.BatchNorm1d(4*n),
            nn.LeakyReLU(0.1),
            nn.Dropout(0.2),
            nn.Linear(4*n, 2*n),
            nn.BatchNorm1d(2*n),
            nn.LeakyReLU(0.1),
            nn.Dropout(0.2)
        )
        
        # Prediction heads
        self.direct_head = nn.Sequential(
            nn.Linear(2*n, n),
            nn.BatchNorm1d(n),
            nn.LeakyReLU(0.1),
            nn.Linear(n, 1)
        )
        
        self.residual_head = nn.Sequential(
            nn.Linear(2*n, n),
            nn.BatchNorm1d(n),
            nn.LeakyReLU(0.1),
            nn.Dropout(0.2),
            nn.Linear(n, n//2),
            nn.BatchNorm1d(n//2),
            nn.LeakyReLU(0.1),
            nn.Linear(n//2, 1)
        )
        
        # Location embeddings
        self.pu_embed = torch.nn.Embedding(PUvals, embed_dim)
        self.do_embed = torch.nn.Embedding(DOvals, embed_dim)
        
        # Learning rate scheduler parameters
        self.save_hyperparameters()
        
    def common_step(self, batch, batch_idx):
        x, pu, do, y = batch
        pu_vec = self.pu_embed(pu)
        do_vec = self.do_embed(do)
        X = torch.hstack((x, pu_vec, do_vec))
        
        # Extract features
        features = self.feature_net(X)
        
        # Get predictions from both heads
        pred1 = self.direct_head(features)
        pred2 = self.residual_head(features)
        
        # Combine predictions
        y_hat = (pred1 + pred2) / 2
        y = y.view(-1, 1)
        
        return y, y_hat
    
    def configure_optimizers(self):
        optimizer = optim.AdamW(self.parameters(), lr=self.lr, weight_decay=1e-4)
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode='min', factor=0.5, patience=3, verbose=True
        )
        return {
            "optimizer": optimizer,
            "lr_scheduler": {
                "scheduler": scheduler,
                "monitor": "val_loss",
                "frequency": 1
            }
        }

This advanced model incorporates several sophisticated techniques:

  1. LeakyReLU Activation: Instead of standard ReLU, we use LeakyReLU, which allows a small gradient when the unit is not active, helping to prevent “dying ReLU” problems.

  2. Ensemble-like Architecture: The model uses two “heads” that make predictions independently, then averages their results. This ensemble-like approach can improve robustness.

  3. AdamW Optimizer: We use AdamW, which adds proper weight decay regularization to the Adam optimizer.

  4. Learning Rate Scheduler: The ReduceLROnPlateau scheduler reduces the learning rate when the validation loss plateaus, allowing the model to make more refined updates as training progresses.

  5. Hyperparameter Tracking: The save_hyperparameters() method automatically logs all constructor arguments, making it easier to track experiments.

This model architecture demonstrates how deep learning allows us to combine multiple techniques in flexible ways. The specific design choices reflect common practices in modern neural network architecture, though there’s still considerable art and experimentation involved in finding the best configuration for a particular problem.

Let’s train this advanced model:

advanced_model = AdvancedModel(other_dim, embed_dim=32, inner_dim=128)
trainer = L.Trainer(
    logger=TensorBoardLogger("tb_logs", name="advanced_model"),
    max_epochs=20,
    callbacks=[
        L.pytorch.callbacks.EarlyStopping(
            monitor="val_loss", patience=5, mode="min"
        )
    ]
)
trainer.fit(advanced_model, train_dl, val_dl)
trainer.test(advanced_model, test_dl)

We’ve added an early stopping callback, which will halt training if the validation loss doesn’t improve for 5 consecutive epochs. This helps prevent overfitting and saves computational resources.

9.10 Conclusion

In this chapter, we’ve advanced our neural network toolkit considerably. We’ve moved from basic PyTorch implementations to using Lightning for cleaner, more organized code. We’ve explored embedding layers for categorical variables, leveraged TensorBoard for monitoring, and implemented regularization techniques such as dropout and batch normalization.

These tools and techniques form the foundation of modern deep learning practice. While the examples in this chapter focused on tabular data (taxi fares), the same principles apply to more complex domains like computer vision, natural language processing, and time series analysis.

Key takeaways from this chapter include:

  • PyTorch Lightning provides structure and reduces boilerplate while maintaining flexibility
  • Embedding layers offer an effective way to handle categorical variables with high cardinality
  • Monitoring training with TensorBoard provides insights that help diagnose and improve models
  • Regularization techniques like dropout and batch normalization are essential for preventing overfitting
  • Modern architectures often combine multiple techniques and require experimentation

As you continue your deep learning journey, remember that there’s no one-size-fits-all approach. The best model for a given problem depends on the data, constraints, and specific requirements. The techniques covered in this chapter provide a solid foundation, but successful application requires experimentation, intuition, and domain knowledge.