9 lightning
and Advanced NN Techniques
In previous chapters, we delved into the theoretical foundations of neural networks and explored their implementation with PyTorch. Now, we advance to more sophisticated methods for neural network training and architecture design. These techniques represent the modern practice of deep learning, enabling more efficient model development, improved performance, and increased reproducibility.
This chapter introduces PyTorch Lightning, a high-level framework that streamlines neural network implementation while maintaining PyTorch’s flexibility. We will also explore embedding layers for categorical data, examine regularization techniques such as dropout and batch normalization, and utilize TensorBoard for monitoring model training. Building on our NYC taxi fare prediction task, we will progressively enhance our models while observing how each improvement affects performance.
9.1 Loading and Cleaning the Data and Preparing DataLoaders
We begin with our familiar NYC taxi dataset. As in previous chapters, we’ll download the data and perform initial cleaning operations.
from pathlib import Path
import requests
= Path('data/fhvhv_tripdata_2024-01.parquet')
local_path
= 'https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2024-01.parquet'
url
if not local_path.exists():
=True)
local_path.parent.mkdir(exist_ok local_path.write_bytes(requests.get(url).content)
Now, we import the necessary libraries. Note that we’re adding Lightning to our toolkit:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
import torch
from torch import nn, optim, utils
import lightning as L
# Set random seed for reproducibility
110, workers=True) L.pytorch.seed_everything(
The L.pytorch.seed_everything()
function ensures reproducibility by setting random seeds for all relevant libraries (PyTorch, NumPy, Python’s random) in a single call. The workers=True
parameter ensures that data loading workers also maintain deterministic behavior.
Let’s load and prepare our dataset:
= pd.read_parquet('data/fhvhv_tripdata_2024-01.parquet',
df = ['hvfhs_license_num','request_datetime',
columns 'trip_miles','trip_time','base_passenger_fare',
'driver_pay','PULocationID','DOLocationID']).sample(1_000_000)
# Clean the data by filtering outliers
= df[(df['trip_miles']>=1)
df & (df['trip_miles']<=20)
& (df['base_passenger_fare']<200)]
# Feature engineering
'request_day_of_week'] = df['request_datetime'].dt.dayofweek
df['request_hour_of_day'] = df['request_datetime'].dt.hour
df['fare_per_mile'] = df['base_passenger_fare']/df['trip_miles'] df[
Note that we’re sampling 1 million records from the dataset to make training more manageable. While this is still a substantial amount of data, it allows for faster experimentation without overly compromising model quality.
Next, we define our feature sets and split the data:
= ['hvfhs_license_num', 'request_day_of_week', 'request_hour_of_day']
categorical_features = ['trip_miles', 'trip_time']
numerical_features
= df[categorical_features+numerical_features+['PULocationID','DOLocationID']]
X = df['fare_per_mile']
y
# Create training, validation, and test sets
= train_test_split(X, y,
X_train_val, X_test, y_train_val, y_test =0.1, random_state=100)
test_size= train_test_split(X_train_val, y_train_val,
X_train, X_val, y_train, y_val =0.1, random_state=100)
test_size# Free memory
del X_train_val
del y_train_val
Notice that we’ve created three distinct datasets:
- Training set (81% of data): Used to update model parameters
- Validation set (9% of data): Used to tune hyperparameters and monitor for overfitting
- Test set (10% of data): Used for final evaluation only
This three-way split is a standard practice in deep learning. The validation set helps us monitor the model’s generalization ability during training and make informed decisions about hyperparameters, while the test set provides an unbiased final evaluation.
For the preprocessing of our numerical and categorical features, we’ll use scikit-learn’s ColumnTransformer
:
= ColumnTransformer([
ct 'num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features)
(
])
ct.fit(X_train)
We’re treating the pickup and dropoff location IDs (PULocationID
and DOLocationID
) differently, as we’ll use embedding layers for them:
# Get min and max values for location IDs to determine embedding sizes
= df['PULocationID'].max(), df['PULocationID'].min()
PUmax, PUmin = PUmax - PUmin + 1
PUvals = df['DOLocationID'].max(), df['DOLocationID'].min()
DOmax, DOmin = DOmax - DOmin + 1 DOvals
Now, let’s create a function to process our data and turn it into PyTorch’s TensorDataset
:
def mk_dataset(X, y):
# Transform features with ColumnTransformer
= ct.transform(X).astype(np.float32)
X_trans return utils.data.TensorDataset(
torch.from_numpy(X_trans),'PULocationID'].values - PUmin),
torch.from_numpy(X['DOLocationID'].values - DOmin),
torch.from_numpy(X[
torch.from_numpy(y.values.astype(np.float32)) )
This function: 1. Transforms the input features using our preprocessor 2. Converts the normalized location IDs to zero-based indices (required for embedding layers) 3. Assembles a TensorDataset
with four components: transformed features, pickup location indices, dropoff location indices, and target values
Finally, we create our DataLoaders:
# Create dataloaders
= utils.data.DataLoader(mk_dataset(X_train, y_train),
train_dl =True, batch_size=1024, num_workers=4)
shuffle= utils.data.DataLoader(mk_dataset(X_val, y_val),
val_dl =1024, num_workers=4)
batch_size= utils.data.DataLoader(mk_dataset(X_test, y_test),
test_dl =1024, num_workers=4) batch_size
Note these important DataLoader parameters: - shuffle=True
for the training set ensures that each epoch sees a different order of samples, which helps prevent overfitting - batch_size=1024
defines how many samples are processed in each iteration - num_workers=4
enables parallel data loading, which can significantly speed up training
9.2 Why Use Lightning
Before we start implementing our models, let’s discuss why we’re using PyTorch Lightning in this chapter. Traditional PyTorch code, while flexible, often contains repetitive boilerplate code that can obscure the core model logic.
PyTorch Lightning is a lightweight wrapper around PyTorch that provides a structured organization for neural network code while preserving all of PyTorch’s flexibility. It offers several advantages:
Code Organization: Lightning enforces a clean separation between research code (model architecture, loss functions) and engineering code (training loops, GPU handling, distributed training).
Reduced Boilerplate: Common operations like moving tensors to the correct device, gradient calculation, and parameter updates are handled automatically.
Built-in Features: Lightning provides out-of-the-box support for logging, checkpointing, early stopping, and other training utilities.
Scalability: The same code can easily scale from a single CPU to multiple GPUs or even multiple machines with minimal changes.
Reproducibility: Lightning makes it easier to ensure consistent results by standardizing the training process.
Lightning is particularly beneficial for economics research, where reproducibility and transparency are paramount. By separating the scientific components (model definition, hyperparameters) from the engineering details, Lightning makes it easier to communicate and replicate research findings.
The core abstraction in Lightning is the LightningModule
class, which encapsulates: - The model architecture (__init__
method) - The forward pass (forward
method) - Training, validation, and test logic (training_step
, validation_step
, test_step
methods) - Optimization configuration (configure_optimizers
method)
This organization makes the code more readable and maintainable while reducing potential sources of error.
9.3 Setting Up a Base Model Class in Lightning
Let’s create a base class that implements the common functionality needed by all our models:
class BasicModel(L.LightningModule):
def __init__(self, lr=1e-4):
super().__init__()
self.lr = lr
def training_step(self, batch, batch_idx):
= self.common_step(batch, batch_idx)
y, y_hat = nn.functional.mse_loss(y, y_hat)
loss self.log("training_loss", loss)
return loss
def test_step(self, batch, batch_idx):
= self.common_step(batch, batch_idx)
y, y_hat = nn.functional.mse_loss(y, y_hat)
loss self.log("test_loss", loss)
return loss
def validation_step(self, batch, batch_idx):
= self.common_step(batch, batch_idx)
y, y_hat = nn.functional.mse_loss(y, y_hat)
loss self.log("val_loss", loss)
return loss
def configure_optimizers(self):
= optim.Adam(self.parameters(), self.lr)
optimizer return optimizer
This BasicModel
class:
- Inherits from
L.LightningModule
- Accepts a learning rate parameter
- Defines methods for training, validation, and testing steps
- Uses mean squared error (MSE) as the loss function
- Configures the Adam optimizer
- Logs the loss for each phase (training, validation, test)
The common_step
method is not implemented in the base class—it will be provided by each specific model subclass. This follows the template method pattern from software design: the base class defines the general algorithm structure, while subclasses implement the specific details.
9.4 Setting Up a Trivial Model in Lightning
Now, let’s implement a simple model that doesn’t use embedding layers but treats location IDs as regular numerical features:
class TrivialModel(BasicModel):
def __init__(self, other_dim, lr=1e-4):
# other_dim is the dimension of features other than PU and DO
super().__init__(lr)
self.model = nn.Sequential(
+ 2, 512), # +2 for the two location IDs
nn.Linear(other_dim
nn.ReLU(),512, 1)
nn.Linear(
)
def common_step(self, batch, batch_idx):
= batch
x, pu, do, y # Concatenate all features
= torch.hstack((x, torch.unsqueeze(pu, 1), torch.unsqueeze(do, 1)))
X = self.model(X)
y_hat = y.view(-1, 1) # Reshape target to match prediction shape
y return y, y_hat
The TrivialModel
:
- Treats location IDs as regular numerical features
- Uses a simple network with one hidden layer of 512 units
- Concatenates all input features into a single tensor
- Implements the
common_step
method required by our base class
Before training, we need to determine the input dimension:
= train_dl.dataset[0][0].shape[0] other_dim
This extracts the dimension of the preprocessed features (excluding location IDs) from the first element of our training dataset.
9.5 Training and Testing
With our model defined, we can now train and evaluate it:
= L.Trainer(deterministic=True, max_epochs=10)
trivial_trainer = TrivialModel(other_dim)
trivial_model
trivial_trainer.fit(trivial_model, train_dl, val_dl) trivial_trainer.test(trivial_model, test_dl)
The L.Trainer
class automates the training process. We specify: - deterministic=True
to ensure reproducible results - max_epochs=10
to limit the number of training epochs
The fit
method handles the entire training process, including:
- Iterating through epochs
- Processing batches
- Computing gradients
- Updating parameters
- Validating after each epoch
- Logging metrics
The test
method evaluates the model on the test set after training is complete.
One of Lightning’s key advantages is that all these steps are handled automatically, with sensible defaults that can be customized when needed. This dramatically reduces the amount of code we need to write while ensuring best practices are followed.
9.6 Embedding Layer in Neural Networks
Our trivial model treats location IDs as regular numerical features, but this approach has limitations. Location IDs are categorical variables with no inherent ordering—treating them as numbers can introduce arbitrary relationships that don’t exist in reality.
A more appropriate approach for categorical variables with high cardinality (many possible values) is to use embedding layers. An embedding layer maps each categorical value to a dense vector in a lower-dimensional space, learning meaningful representations during training.
9.6.1 Understanding Embeddings
An embedding layer can be thought of as a lookup table: given a categorical ID, it returns a corresponding vector. The embedding vectors are learned parameters that the model optimizes during training.
Formally, for a categorical variable with \(K\) possible values, an embedding layer creates a matrix \(E \in \mathbb{R}^{K \times d}\) where \(d\) is the embedding dimension. When processing an input with category \(i\), the layer outputs the vector \(E_i \in \mathbb{R}^d\) (the \(i\)-th row of the embedding matrix).
Embeddings offer several advantages:
- They capture semantic relationships between categories
- They reduce dimensionality compared to one-hot encoding
- They learn useful representations based on the prediction task
In our taxi fare prediction task, embeddings can learn representations of locations that capture relevant factors like neighborhood affluence, distance from tourist attractions, or traffic patterns—all of which might affect fare prices.
9.6.2 Adding Embeddings to the Model
Let’s implement a model that uses embedding layers for the pickup and dropoff locations:
class Model3(BasicModel):
def __init__(self, other_dim, embed_dim=16, lr=1e-4):
super().__init__(lr)
self.model = nn.Sequential(
+ 2*embed_dim, 1024),
nn.Linear(other_dim
nn.ReLU(),1024, 1)
nn.Linear(
)self.pu_embed = torch.nn.Embedding(PUvals, embed_dim)
self.do_embed = torch.nn.Embedding(DOvals, embed_dim)
def common_step(self, batch, batch_idx):
= batch
x, pu, do, y = self.pu_embed(pu)
pu_vec = self.do_embed(do)
do_vec = torch.hstack((x, pu_vec, do_vec))
X = self.model(X)
y_hat = y.view(-1, 1)
y return y, y_hat
This model:
- Creates two embedding layers, one for pickup locations and one for dropoff locations
- Sets the embedding dimension to 16 (a hyperparameter we can tune)
- Extracts embedding vectors for each location ID
- Concatenates these embedding vectors with the other features
- Processes the combined features through a neural network
The size of an embedding (embed_dim) is typically a hyperparameter that needs tuning, but a common rule of thumb is to use \(\text{embed\_dim} \approx \sqrt[4]{n}\) where \(n\) is the number of possible categories. However, embedding dimensions between 8 and 512 are common, with smaller values for simpler relationships and larger values for more complex ones.
Now let’s train this model:
= Model3(other_dim, embed_dim=32)
model3 = L.Trainer(deterministic=True, max_epochs=15)
trainer3
trainer3.fit(model3, train_dl, val_dl) trainer3.test(model3, test_dl)
Notice that we’ve increased the embedding dimension to 32 and the number of epochs to 15, allowing the model more capacity and training time to learn meaningful representations.
9.7 Using TensorBoard for Monitoring
While the logging we’ve done so far provides basic metrics, we often need more detailed insights into the training process. TensorBoard is a visualization toolkit that provides graphical representations of model metrics over time.
Lightning integrates seamlessly with TensorBoard. To use it, we simply add a TensorBoard logger to our trainer:
from lightning.pytorch.loggers import TensorBoardLogger
= TensorBoardLogger("tb_logs", name="taxi_fare_model")
logger = L.Trainer(logger=logger, max_epochs=15) trainer
Lightning automatically logs metrics defined with the self.log()
method. For more complex logging, we can use the log_dict()
method to track multiple metrics at once.
To enhance our monitoring, we can add additional metrics to our model’s validation step:
def validation_step(self, batch, batch_idx):
= self.common_step(batch, batch_idx)
y, y_hat = nn.functional.mse_loss(y, y_hat)
loss = torch.abs(y - y_hat).mean()
mae self.log_dict({
"val_loss": loss,
"val_mae": mae,
})return loss
This adds the mean absolute error (MAE) to our logged metrics, providing another perspective on model performance.
TensorBoard offers several visualization tools:
- Scalars: Plots of metrics over time
- Distributions: Histograms of parameter values
- Gradients: Statistics on parameter gradients
- Images: Visualizations of model inputs/outputs
- Graphs: Computational graph visualization
- Embeddings: Projections of high-dimensional embeddings
To view TensorBoard while training or after, run:
%load_ext tensorboard
%tensorboard --logdir tb_logs
This provides a real-time view of your model’s performance, helping you identify issues like overfitting (validation loss increases while training loss continues to decrease) or learning rate problems (erratic loss curves).
9.8 Dropout and Batch Normalization
As we build more complex models, we need to consider regularization techniques to prevent overfitting. Two powerful regularization methods in deep learning are dropout and batch normalization.
9.8.1 Dropout
Dropout is a regularization technique that randomly sets a fraction of input units to zero during training, which helps prevent co-adaptation of neurons. During testing, all neurons are active but with scaled-down weights.
The intuition behind dropout is similar to ensemble methods: by randomly dropping neurons, we’re effectively training a different sub-network at each training step. This forces the network to learn redundant representations and not rely too heavily on any single neuron.
9.8.2 Batch Normalization
Batch normalization addresses the problem of internal covariate shift, where the distribution of each layer’s inputs changes during training as the parameters of the previous layers change. It normalizes the layer inputs to have zero mean and unit variance across each mini-batch, then applies a learnable scale and shift.
Batch normalization offers several benefits:
- Faster training (allows higher learning rates)
- Reduced sensitivity to initialization
- Acts as a form of regularization
- Improves gradient flow in deep networks
Let’s implement a model that incorporates both dropout and batch normalization:
class Model5(BasicModel):
def __init__(self, other_dim, embed_dim=16, inner_dim=64, lr=1e-4):
super().__init__(lr)
= inner_dim
n = other_dim + 2*embed_dim
input_dim
self.block1 = nn.Sequential(
4*n),
nn.Linear(input_dim, 4*n),
nn.BatchNorm1d(
nn.ReLU(),0.25),
nn.Dropout(4*n, 2*n),
nn.Linear(2*n),
nn.BatchNorm1d(
nn.ReLU(),0.25),
nn.Dropout(2*n, n),
nn.Linear(
nn.BatchNorm1d(n),
nn.ReLU(),
nn.Linear(n, input_dim)
)
self.block2 = nn.Sequential(
4*n),
nn.Linear(input_dim, 4*n),
nn.BatchNorm1d(
nn.ReLU(),0.25),
nn.Dropout(4*n, 2*n),
nn.Linear(2*n),
nn.BatchNorm1d(
nn.ReLU(),2*n, 1)
nn.Linear(
)
self.pu_embed = torch.nn.Embedding(PUvals, embed_dim)
self.do_embed = torch.nn.Embedding(DOvals, embed_dim)
def common_step(self, batch, batch_idx):
= batch
x, pu, do, y = self.pu_embed(pu)
pu_vec = self.do_embed(do)
do_vec = torch.hstack((x, pu_vec, do_vec))
X = self.block1(X)
out1 = self.block2(X + out1) # Residual connection
y_hat = y.view(-1, 1)
y return y, y_hat
This model includes:
- Two embedding layers for location IDs
- Batch normalization after each linear layer
- Dropout with a rate of 0.25 after several layers
- A residual connection (X + out1) that helps gradient flow in deep networks
The model structure is also more complex, with two “blocks” of layers. The first block processes the input and produces an output with the same dimension, which is then added to the original input (a residual connection). The second block further processes this combined representation to produce the final prediction.
Let’s train this model:
= Model5(other_dim, embed_dim=32)
model5 = L.Trainer(deterministic=True, max_epochs=15)
trainer5
trainer5.fit(model5, train_dl, val_dl) trainer5.test(model5, test_dl)
In practice, you might want to experiment with different dropout rates and batch normalization configurations. The commented-out lines in the code snippet show alternative configurations that you could try.
9.9 A More Complicated Model
Building upon our knowledge of embeddings, batch normalization, and dropout, let’s implement an even more sophisticated model that incorporates additional architectural elements:
class AdvancedModel(BasicModel):
def __init__(self, other_dim, embed_dim=32, inner_dim=128, lr=1e-4):
super().__init__(lr)
= inner_dim
n = other_dim + 2*embed_dim
input_dim
# Feature extraction network
self.feature_net = nn.Sequential(
4*n),
nn.Linear(input_dim, 4*n),
nn.BatchNorm1d(0.1),
nn.LeakyReLU(0.2),
nn.Dropout(4*n, 2*n),
nn.Linear(2*n),
nn.BatchNorm1d(0.1),
nn.LeakyReLU(0.2)
nn.Dropout(
)
# Prediction heads
self.direct_head = nn.Sequential(
2*n, n),
nn.Linear(
nn.BatchNorm1d(n),0.1),
nn.LeakyReLU(1)
nn.Linear(n,
)
self.residual_head = nn.Sequential(
2*n, n),
nn.Linear(
nn.BatchNorm1d(n),0.1),
nn.LeakyReLU(0.2),
nn.Dropout(//2),
nn.Linear(n, n//2),
nn.BatchNorm1d(n0.1),
nn.LeakyReLU(//2, 1)
nn.Linear(n
)
# Location embeddings
self.pu_embed = torch.nn.Embedding(PUvals, embed_dim)
self.do_embed = torch.nn.Embedding(DOvals, embed_dim)
# Learning rate scheduler parameters
self.save_hyperparameters()
def common_step(self, batch, batch_idx):
= batch
x, pu, do, y = self.pu_embed(pu)
pu_vec = self.do_embed(do)
do_vec = torch.hstack((x, pu_vec, do_vec))
X
# Extract features
= self.feature_net(X)
features
# Get predictions from both heads
= self.direct_head(features)
pred1 = self.residual_head(features)
pred2
# Combine predictions
= (pred1 + pred2) / 2
y_hat = y.view(-1, 1)
y
return y, y_hat
def configure_optimizers(self):
= optim.AdamW(self.parameters(), lr=self.lr, weight_decay=1e-4)
optimizer = optim.lr_scheduler.ReduceLROnPlateau(
scheduler ='min', factor=0.5, patience=3, verbose=True
optimizer, mode
)return {
"optimizer": optimizer,
"lr_scheduler": {
"scheduler": scheduler,
"monitor": "val_loss",
"frequency": 1
} }
This advanced model incorporates several sophisticated techniques:
LeakyReLU Activation: Instead of standard ReLU, we use LeakyReLU, which allows a small gradient when the unit is not active, helping to prevent “dying ReLU” problems.
Ensemble-like Architecture: The model uses two “heads” that make predictions independently, then averages their results. This ensemble-like approach can improve robustness.
AdamW Optimizer: We use AdamW, which adds proper weight decay regularization to the Adam optimizer.
Learning Rate Scheduler: The
ReduceLROnPlateau
scheduler reduces the learning rate when the validation loss plateaus, allowing the model to make more refined updates as training progresses.Hyperparameter Tracking: The
save_hyperparameters()
method automatically logs all constructor arguments, making it easier to track experiments.
This model architecture demonstrates how deep learning allows us to combine multiple techniques in flexible ways. The specific design choices reflect common practices in modern neural network architecture, though there’s still considerable art and experimentation involved in finding the best configuration for a particular problem.
Let’s train this advanced model:
= AdvancedModel(other_dim, embed_dim=32, inner_dim=128)
advanced_model = L.Trainer(
trainer =TensorBoardLogger("tb_logs", name="advanced_model"),
logger=20,
max_epochs=[
callbacks
L.pytorch.callbacks.EarlyStopping(="val_loss", patience=5, mode="min"
monitor
)
]
)
trainer.fit(advanced_model, train_dl, val_dl) trainer.test(advanced_model, test_dl)
We’ve added an early stopping callback, which will halt training if the validation loss doesn’t improve for 5 consecutive epochs. This helps prevent overfitting and saves computational resources.
9.10 Conclusion
In this chapter, we’ve advanced our neural network toolkit considerably. We’ve moved from basic PyTorch implementations to using Lightning for cleaner, more organized code. We’ve explored embedding layers for categorical variables, leveraged TensorBoard for monitoring, and implemented regularization techniques such as dropout and batch normalization.
These tools and techniques form the foundation of modern deep learning practice. While the examples in this chapter focused on tabular data (taxi fares), the same principles apply to more complex domains like computer vision, natural language processing, and time series analysis.
Key takeaways from this chapter include:
- PyTorch Lightning provides structure and reduces boilerplate while maintaining flexibility
- Embedding layers offer an effective way to handle categorical variables with high cardinality
- Monitoring training with TensorBoard provides insights that help diagnose and improve models
- Regularization techniques like dropout and batch normalization are essential for preventing overfitting
- Modern architectures often combine multiple techniques and require experimentation
As you continue your deep learning journey, remember that there’s no one-size-fits-all approach. The best model for a given problem depends on the data, constraints, and specific requirements. The techniques covered in this chapter provide a solid foundation, but successful application requires experimentation, intuition, and domain knowledge.