6 Gradient Boosted Decision Trees

Note

This is an EARLY DRAFT.

In this chapter, we will combine our knowledge of decision trees and gradient decent optimization to learn about a very power machine learning framework: gradient boosted decision trees (GBDTs). At the time of writing GBDTs are the best performing models for medium sized tabular data, a position from which neural networks have not been able to displace them despite their great successes elsewhere.

GBDTs belong to the class of ensemble models, models which pool together predictions from other models. The prediction from a GBDT is of the form

\[f(x) = \sum_{i=0}^\infty f_i(x,\beta_i)\]

where each $f_i$ represents the prediction from a decision tree, with $\beta_i$ is that tree’s parameters, representing in an abstract way its shape and splitting conditions. GBDTs are fit to data in a sequential way. We first fit a decision tree representing by $f_0$, look at its errors, then fit $f_1$ to best rectify those errors, and so on. This general approach is known as boosting and its development in the 1990s by Freund and Schapire, in the form of an algorithm called AdaBoost, was a major milestone in machine learning.

We will see that the process of fitting GBDTs can be fruitfully viewed as gradient descent in a function space. But before we get into that, we need to review the loss function that we are aiming to minimize.

6.1 Surrogate Loss Functions for Classification

In this first section, we focus on surrogate loss functions commonly used in gradient boosting for classification. The true target in many classification tasks is a discrete label (e.g., $y \in \{0, 1\}$ for binary classification). In the previous chapter on decision trees we looked at different performance measures such as accuracy for this task. However, with a discrete variable like $y$ we cannot use the tools of differential calculus. To get around this we have our main model to output not the discrete class predictions but instead a score which reflects the model’s confidence that the observation belongs to a particular class. A very common score is the probability that the model assigns to an observation being in a certain class. We then define a proxy loss function, known as a surrogate loss function, that measures the ‘goodness’ of this score, and fit the model to minimize that. Once the model is fit, we can predict class membership if required on the basis of the scores. For example, if the scores are probabilities we might predict the observation to belong to that class which has the higest probability according to the model.

6.1.1 The Exponential Loss

The original AdaBoost algorithm, effectively used the exponential loss function. Suppose we have a dataset of $n$ points $(x_i, y_i)$ where $x_i$ is a feature vector and $y_i \in \{-1, +1\}$ is a class label (for convenience, we sometimes let $y_i \in \{-1, +1\}$ instead of $\{0,1\}$ in this section). Our goal is to find a function $F(x)$ that predicts a score. The final class label prediction being $+1$ if the predicted score is positive, and $-1$ if the predicted score is negative.

The exponential loss is given by

\[ L_{\mathrm{exp}}(y_i, F(x_i)) = \exp\bigl(-y_i \, F(x_i)\bigr). \]

During training, if the model makes the correct prediction $y_i$ and $F(x_i)$ will have the same sign and hence $y_iF(x_i)$ would be positive. On the other hand if the model makes an incorrect prediction $y_i$ and $F(x_i)$ would have opposite signs and $y_iF(x_i)$ would be negative. Therefore, by calculating $\exp(-y_i\,F(x_i))$ we get large losses for misclassification and small losses for correct classification.

It turns out that the quantity $y_iF(x_i)$, called the margin, is interesting not just for its sign. When the margin is a large positive number our model is not just correctly fitting the training sample, it is predicting the right answer with high confidence. When the margin is a large negative number it shows not only that the model is wrong, it is wrong with high confidence. It turns out fitting classification models to maximize the margin instead of just getting the right predictions on the training sample is a good way to get a model which also predicts well on new observations.

6.1.2 The Log Loss

A more widely used surrogate loss for classification is the logistic loss, also known as the log loss. In binary classification with labels $y_i \in \{0, 1\}$ (or equivalently $\{0, 1\}$ with some re-parameterization), we first derive the predicted probability that $y_i = +1$ by the formula

\[ p_i = \sigma\bigl(F(x_i)\bigr) = \frac{1}{1 + \exp(-F(x_i))}, \]

and then define the loss to be the negative log-likelihood of the observed $y_i$. \[ L_{\mathrm{log}}(y_i, F(x_i)) = -\Bigl[y_i \log p_i + (1-y_i)\log (1 - p_i)\Bigr], \]

where $p_i = \sigma(F(x_i))$.

Log loss is more prevalent in modern machine learning frameworks due to its probabilistic interpretation. Minimizing log loss corresponds to maximizing the likelihood under a Bernoulli model. In practice, log loss has been found to be more robust than exponential loss. Accordingly, most gradient boosting libraries default to a variation of log loss for binary classification tasks.

6.1.3 Cross-entropy loss

For multiclass classification with $K$ classes, the most common loss function is the multiclass log loss (also known as cross-entropy loss). If we denote by $p_{ik}$ the predicted probability that observation $i$ belongs to class $k$, and by $y_{ik}$ the binary indicator that is 1 if observation $i$ truly belongs to class $k$ and 0 otherwise, then:

\[ L_{\mathrm{multi}}(y_i, F(x_i)) = -\sum_{k=1}^K y_{ik} \log p_{ik}, \]

where $p_{ik}$ is typically computed using the softmax function:

\[ p_{ik} = \frac{\exp(F_k(x_i))}{\sum_{j=1}^K \exp(F_j(x_i))}. \]

Here, $F_k(x_i)$ represents the score predicted by the model for class $k$. This formulation naturally extends the binary log loss to multiple classes while maintaining its desirable properties of probabilistic interpretation and robustness.

The softmax function gets its name because it is a “soft” version of the maximum function. While a regular maximum would assign 1.0 to the highest score and 0.0 to all others, softmax produces a smoother probability distribution where larger inputs receive larger probabilities, but smaller inputs still get some non-zero probability mass. The exponential function ensures all probabilities are positive, and the denominator normalizes them to sum to one.

6.1.4 Losses for regression problems

For regression tasks, several loss functions are commonly used, each with different properties. The Mean Squared Error (MSE) loss, $L(y_i, F(x_i)) = \frac{1}{2}(y_i - F(x_i))^2$, already familiar to us from linear regression, is perhaps the most widely used. It is continuously differentiable everywhere, making it particularly suitable for gradient-based optimization. Its gradient with respect to the prediction $F(x_i)$ is simply $(F(x_i) - y_i)$, which gives us intuitive pseudo-residuals.

The Mean Absolute Error (MAE) loss, $L(y_i, F(x_i)) = |y_i - F(x_i)|$, is another option that is more robust to outliers. However, it is not differentiable at $y_i = F(x_i)$, which can make optimization more challenging. The gradient is either +1 or -1 depending on whether the prediction is less than or greater than the true value.

The Huber loss provides a compromise between MSE and MAE. For a parameter $\delta > 0$, it is defined as:

\[ L(y_i, F(x_i)) = \begin{cases} \frac{1}{2}(y_i - F(x_i))^2 & \text{if } |y_i - F(x_i)| \leq \delta \\ \delta|y_i - F(x_i)| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases} \]

Huber loss behaves quadratically for small errors (like MSE) but linearly for large errors (like MAE), combining the best of both: differentiability near zero and robustness to outliers.

6.2 Additive Models: Summation of Weak Learners

Having seen the kind of loss functions we might be interested in minimizing, let’s now look at how GBDTs go about minimizing them.

Gradient boosting is best viewed in the framework of additive models. Instead of fitting a single, monolithic model (like one large decision tree), we fit a sequence of simpler models (often called weak learners) that, when summed together, form a powerful predictor.

Mathematically, suppose we have a loss function $L(y_i, F(x_i))$ we want to minimize over $i=1,\dots,n$. We posit an additive form for $F$:

\[ F(x) = \sum_{m=1}^M \gamma_m h_m(x), \]

where each $h_m(x)$ is a decision tree (our weak learner) and $\gamma_m$ is a scalar weight (which might be absorbed into the tree itself in certain formulations). In boosting, we build up $F(x)$ one weak learner at a time:

\[ F_1(x) = \gamma_1 h_1(x), \]

\[ F_2(x) = \gamma_1 h_1(x) + \gamma_2 h_2(x), \]

\[ \cdots \]

\[ F_M(x) = \sum_{m=1}^M \gamma_m h_m(x). \]

After adding each new term, we hope to reduce the overall training loss. The final model is thus an ensemble (ensemble being just a fancy work for set or collection), with each new weak learner focusing on how to correct the mistakes of the already-fitted ensemble.

6.3 Gradient Boosting as Gradient Descent in Function Space

We now describe how the successive trees as fitted and why it is meaningful to gradient boosting as gradient descent in function space. T

6.3.1 Functions as Parameters

Recall from earlier chapters that in standard parameter optimization, we have a parameter vector $\theta \in \mathbb{R}^d$ and a loss function $L(\theta)$ to minimize. We apply gradient descent by iterating:

\[ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta L\bigl(\theta^{(t)}\bigr), \]

where $\eta$ is the learning rate. The key idea in functional gradient descent is that instead of $\theta$, we treat the function $F(x)$ itself as the parameter to be updated. That is, for classification or regression, we want to find $F^{(t+1)}$ from $F^{(t)}$ by stepping in the negative gradient direction of our loss with respect to the function $F$.

6.3.2 The Gradient with Respect to a Function

At this point you may be feeling apprehensive if your calculus courses have not taught you to take derivatives with respect to functions (I don’t know why they don’t use Dieudonné’s calculus book in schools any more). On the other hand if you are a math person the situation may be looking as inviting as fresh gateaux. But hold on for a second. We observe $F$ only at the finite number of $x_i$ values in our sample, and it is with respect to the finite $F(x_i)$ values that we can optimize. So our problem will ultimately turn out to be a routine multivariate calculus problem. We use the function space language only to get a clearer intuitive picture of what we are doing.

Let us denote our training dataset by $\{(x_i, y_i)\}_{i=1}^n$ and the loss function by

\[ \mathcal{L}(F) = \sum_{i=1}^n L\bigl(y_i, F(x_i)\bigr). \]

When we speak of the gradient of $\mathcal{L}$ with respect to $F$, we are referring to how $\mathcal{L}$ changes if we perturb $F(x)$ at each $x$. More concretely:

\[ \frac{\partial \mathcal{L}}{\partial F(x_i)} = \frac{\partial L\bigl(y_i, F(x_i)\bigr)}{\partial F(x_i)}. \]

The negative gradient direction at each point $x_i$ is thus

\[ r_i = -\frac{\partial L\bigl(y_i, F(x_i)\bigr)}{\partial F(x_i)}. \]

These $r_i$ are often called the pseudo-residuals.

The term pseudo-residuals comes from the fact that if we were minimizing the distance between $y_i$ and $F(x_i)$ it would have made sense calculate the residual $(y_i-F(x_i))$ and set it as the target for the next weak predictor in our series.

In fact if we have the mean-squared error $\|y_i-F(x_i)\|^2$, as our loss, the $r_i$ about would equal the residuals (times 2).

However, in general we want our algorithm to work with any differentiable loss function. Then the vector of psuedo-residuals gives the direction of movement which would reduce our loss the fastest. We would want our next weak predictor to move our prediction in that direction. That is what the gradient decent philosophy is all about.

6.3.3 Iterative Fitting

At iteration $m$, suppose we have a current model $F_{m-1}(x)$. We compute the pseudo-residuals for each data point $i$:

\[ r_i^{(m)} = - \left[ \frac{\partial L\bigl(y_i, F(x_i)\bigr)}{\partial F(x_i)} \right]_{F=F_{m-1}}. \]

Then we fit a new function $h_m(x)$ (in our case, a decision tree) to these residuals $\{r_i^{(m)}\}$. Formally, we solve an optimization problem that tries to match $h_m(x_i)$ to $r_i^{(m)}$. Typically:

\[ h_m(x) = \underset{h \in \mathcal{H}}{\arg\min} \sum_{i=1}^n \Bigl(r_i^{(m)} - h(x_i)\Bigr)^2, \]

where $\mathcal{H}$ is the hypothesis space of small regression trees. Note that we do not require $h_m(x)$ to perfectly match $r_i^{(m)}$; we just pick the best fit in the sense of least squares or a second-order approximation that we will detail soon.

Finally, we update the model:

\[ F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x), \]

where $\nu$ is the learning rate (sometimes denoted by $\eta$) and $\gamma_m$ is an optimal multiplier that we may solve for (or incorporate into the tree’s leaf values directly). The idea is that we are taking a gradient step in function space, with $\nu$ controlling how big that step is. Remember that taking too big steps can cause us to overshoot the minima.

In practice, for each tree, we might do:

\[ \gamma_m = \underset{\gamma}{\arg\min} \sum_{i=1}^n L\Bigl(y_i, F_{m-1}(x_i) + \gamma \, h_m(x_i)\Bigr). \]

Hence the procedure is:

Initialize $F_0(x)$, typically a constant function that might be chosen by minimizing the loss with respect to a single scalar (for logistic loss, this might be the log-odds of the sample proportion).
For $m$ from 1 to $M$:
1. Compute pseudo-residuals $r_i^{(m)}$.
2. Fit a tree $h_m(x)$ to $\{(x_i, r_i^{(m)})\}$.
3. Solve for $\gamma_m$ to minimize the loss in a line search.
4. Update $F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x)$.

The final $F_M(x)$ is the boosted model.

6.4 Fitting a Tree to Pseudo-Residuals: Second-Order Approximations and Regularization

In classical (first-order) gradient boosting, each new tree $h_m(x)$ is trained to predict the pseudo-residuals: \[ r_i^{(m)} = -\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F = F_{m-1}}. \] We might use least-squares to fit $h_m(x)$ to $r_i^{(m)}$, or we could do something more specialized (e.g., fitting a classification tree to sign $(r_i^{(m)})$). A more sophisticated improvement—used by current GBDT libraries—is to use a second-order (Newton) approximation of the loss. This approach leverages not only the gradient but also the (diagonal of the) Hessian of the loss, leading to more refined updates. Below we derive and explain how second-order approximations yield a splitting criterion and, ultimately, leaf values for the new tree.

6.4.1 Second-Order Expansion of the Loss

To illustrate second-order methods, recall that at iteration $m$ of boosting, we want to add a new tree $h_m(x)$ that best reduces the overall loss:

\[ \mathcal{L}(F_m) = \sum_{i=1}^n L\Bigl(y_i, F_{m-1}(x_i) + \gamma_m h_m(x_i)\Bigr). \]

Directly solving this can be expensive because $h_m(x)$ is a tree with discrete splits and $\gamma_m$ is a scalar (or leaf-wise scalars). Instead, we approximate $L$ around the current model $F_{m-1}$. A second-order Taylor expansion of $L$ around $F_{m-1}(x_i)$ yields:

\[ L\Bigl(y_i, F_{m-1}(x_i) + \delta\Bigr) \approx L\Bigl(y_i, F_{m-1}(x_i)\Bigr) + g_i \, \delta + \tfrac{1}{2} h_i \, \delta^2, \]

where

\[ g_i = \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F=F_{m-1}}, \quad h_i = \left[ \frac{\partial^2 L(y_i, F(x_i))}{\partial F(x_i)^2} \right]_{F=F_{m-1}}, \]

are the first- and second-order derivatives (evaluated at the current model $F_{m-1}$). In boosting parlance, we typically call $g_i$ the “gradient” and $h_i$ the “Hessian” (per data point).

If you remember your Taylor’s approximation this should have given you pause. The Hessian is a matrix and the second-order terms should be a double sum. What GBDT libraries are doing here are dropping all non-diagonal terms in the Hessian to simplify the calculation.

Given a candidate new tree $h(x)$ whose leaf values we denote by $w_j$ for leaf $j$, each sample $i$ belongs to exactly one leaf. Let’s define an indicator $I_j(x_i)$ that is 1 if $x_i$ falls into leaf $j$ and 0 otherwise. Then the increment $\delta_i$ in the second-order expansion is:

\[ \delta_i = \gamma_m \, h(x_i) \approx w_j \quad \text{where } j \text{ satisfies } I_j(x_i)=1. \]

(We often absorb $\gamma_m$ into each leaf’s weight $w_j$, so each leaf has its own scalar. For simplicity, let’s just write $w_j$.)

Hence, the approximate change in the total loss from adding the new tree is:

\[ \sum_{i=1}^n \Bigl[ L\bigl(y_i, F_{m-1}(x_i)\bigr) + g_i \, w_{q(i)} + \tfrac{1}{2} h_i \, w_{q(i)}^2 \Bigr], \]

where $q(i)$ is the leaf index of $x_i$. Since $L\bigl(y_i, F_{m-1}(x_i)\bigr)$ does not depend on $w_j$, we focus on the term:

\[ \sum_{i=1}^n \Bigl[g_i \, w_{q(i)} + \tfrac{1}{2} h_i \, w_{q(i)}^2\Bigr]. \]

6.4.2 Regularization of Leaf Weights

An integral part of second-order boosting is regularization. We typically add a penalty on the leaf weights to avoid overfitting:

\[ \Omega(h) = \frac{\lambda}{2} \sum_{j=1}^J w_j^2 + \gamma J, \]

where: - $J$ is the number of leaves in the new tree $h(x)$ (so $\gamma$ penalizes adding more leaves). - $\lambda \ge 0$ is a parameter that penalizes large leaf weights.

The penalties serve distinct but complementary purposes in controlling model complexity. The penalty $\lambda$ on leaf weights prevents any single tree from making overly confident predictions by shrinking the magnitude of updates. This is particularly important because each tree’s predictions are added to the ensemble - if leaf weights grow too large, a single tree could dominate the ensemble’s predictions. Meanwhile, the leaf count penalty $\gamma$ discourages trees from creating too many leaves, effectively limiting the tree’s ability to partition the feature space too finely. Together, these penalties help prevent overfitting: $\lambda$ controls the “vertical” complexity (prediction magnitudes) while $\gamma$ controls the “horizontal” complexity (feature space partitioning).

Thus, the objective for the new tree’s leaf weights becomes:

\[ \widetilde{\mathcal{L}}(h) = \sum_{i=1}^n \Bigl[g_i \, w_{q(i)} + \tfrac{1}{2} h_i \, w_{q(i)}^2\Bigr] + \frac{\lambda}{2} \sum_{j=1}^J w_j^2 + \gamma J. \]

For fixed structure of the tree (i.e., the splits are fixed, and we have $J$ leaves), we can solve for the optimal weight $w_j$ of each leaf analytically. Indeed, let’s define

\[ G_j = \sum_{i \in \text{leaf } j} g_i, \quad H_j = \sum_{i \in \text{leaf } j} h_i. \]

Then we isolate the terms in $\widetilde{\mathcal{L}}$ relevant to leaf $j$:

\[ \sum_{i \in \text{leaf } j} \Bigl[g_i \, w_j + \tfrac{1}{2} h_i \, w_j^2\Bigr] + \frac{\lambda}{2} w_j^2. \]

Combining, we get:

\[ G_j \, w_j + \tfrac{1}{2} H_j \, w_j^2 + \frac{\lambda}{2} w_j^2 = G_j \, w_j + \frac{1}{2}(H_j + \lambda) \, w_j^2. \]

Taking the derivative w.r.t. $w_j$ and setting it to zero:

\[ G_j + (H_j + \lambda) w_j = 0 \quad\Longrightarrow\quad w_j = - \frac{G_j}{H_j + \lambda}. \]

Hence, for a given fixed tree structure, the optimal leaf weight is the negative ratio of the sum of gradients to the sum of Hessians (plus $\lambda$).

6.4.3 Split Finding and the Gain Formula

The next question is how to choose the splits for the tree structure itself. We typically build a tree greedily, split by split. For each candidate split at a node, we partition the samples that reach that node into two subsets (left and right) and measure the gain in the approximate objective:

\[ \text{gain} = \widetilde{\mathcal{L}}(\text{before split}) \;-\; \widetilde{\mathcal{L}}(\text{after split}). \]

Using the summations above, one can show that if we have a node with aggregated gradient $G$ and Hessian $H$, splitting it into left child ($G_L$, $H_L$) and right child ($G_R$, $H_R$) yields the gain:

\[ \text{gain} \approx \frac{1}{2}\Bigl[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \Bigr] - \gamma, \]

where $\gamma$ is the penalty for adding a new leaf (often the same $\gamma$ used in $\Omega(h)$). The larger this gain is (if it exceeds zero), the more beneficial it is to perform the split. If the gain is negative, the split is not beneficial and we keep the node as a leaf.

The term $-\gamma$ reflects the structural penalty for adding two new leaves instead of one. It can be interpreted as a “cost” for model complexity. The rest of the expression captures how well the split separates the gradients and Hessians into two groups that, in aggregate, reduce the approximate loss.

In practice, implementations like XGBoost or LightGBM iterate over possible split points in each feature to find the one that maximizes the gain. Once a best split is found at the top node, the process repeats recursively for each child node until some stopping criterion is met (e.g., maximum depth, minimum samples per leaf, or a negative split gain).

6.4.4 Summary of the Second-Order Boosting Step

Putting it all together, the second-order approach to adding a new tree $h_m$ at iteration $m$ can be summarized:

Compute first and second derivatives: For each data point $i$, compute \[ g_i^{(m)} = \left[ \frac{\partial L\bigl(y_i, F(x_i)\bigr)}{\partial F(x_i)} \right]_{F = F_{m-1}}, \quad h_i^{(m)} = \left[ \frac{\partial^2 L\bigl(y_i, F(x_i)\bigr)}{\partial F(x_i)^2} \right]_{F = F_{m-1}}. \]
Split finding: Build a decision tree greedily, evaluating potential splits by the gain formula: \[ \text{gain} = \frac{1}{2}\Bigl[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \Bigr] - \gamma, \] where $G_L = \sum_{i \in \text{left}} g_i^{(m)}$, $H_L = \sum_{i \in \text{left}} h_i^{(m)}$, and similarly for $G_R, H_R$.
Leaf value: Once the tree structure is fixed, compute the weight of leaf $j$ via \[ w_j = - \frac{\sum_{i \in j} g_i^{(m)}}{\sum_{i \in j} h_i^{(m)} + \lambda}. \]
Update: \[ F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^J w_j \, \mathbb{1}_{\{x \in \text{leaf } j\}}, \] where $\nu \in (0,1]$ is the learning rate (also called shrinkage).

This completes one iteration of the second-order gradient boosting step. Over many iterations $m=1,\dots,M$, this yields a powerful ensemble of trees, each “paying attention” to the current gradients and Hessians of the training loss.

6.5 Adding a New Tree, the Learning Rate, and Key Hyperparameters

In the previous sections, we explored how gradient boosting builds an additive model step by step, fitting each new tree to pseudo-residuals (or, more generally, taking a second-order Newton step). In this section, we will consolidate these ideas by focusing on how a newly fitted tree is added to the ensemble and the major hyperparameters that govern the overall training process. We will also clarify how the learning rate connects gradient boosting to standard gradient descent.

6.5.1 Connection to Gradient Descent: The Learning Rate

Recall that in gradient descent for a parameter vector $w \in \mathbb{R}^d$, we typically update \[ w \leftarrow w - \eta \nabla L(w), \] where $\eta$ is a scalar step size (learning rate). If $\eta$ is too large, updates may overshoot; if too small, convergence might be very slow.

In gradient boosting, the model $F(x)$ plays the role of the “parameter.” Each new tree $h_m(x)$ is like the negative gradient direction we wish to follow. Then the learning rate $\nu$ (often denoted $\eta$ in some references) scales how large a step we take in function space: \[ F_m(x) = F_{m-1}(x) + \nu \, h_m(x). \] Sometimes we incorporate an additional scalar $\gamma_m$ that is solved via a line search per tree, but in many frameworks (like XGBoost and LightGBM by default), each leaf weight $w_j$ already accounts for the line search implicitly. The extra learning rate $\nu$ then multiplies these leaf weights. Either way, the essential idea is that the learning rate shrinks the newly fitted tree before adding it to the ensemble.

6.5.1.1 Why Smaller is Often Better

A smaller learning rate $\nu$ (for example, 0.05 instead of 0.3) typically makes the training process more robust. By taking smaller steps, each new tree has a more modest impact, reducing the chance of large, sudden overfitting. However, small learning rates generally require more boosting iterations $M$ to achieve similar training loss reduction.

6.5.2 Major Hyperparameters in Gradient Boosted Trees

The beauty (and curse) of gradient boosted decision trees is that there are numerous hyperparameters you can tune to optimize model performance. Some of them are of primary conceptual importance, and we list these below.

Number of Trees ($M$)
- Often denoted by $n\_estimators$ in many libraries.
- A large $M$ means we keep adding more and more small corrections (trees) to our model. If $\nu$ is fixed, typically training loss continues to decrease as $M$ grows, but at some point generalization performance may start to decline if we overfit or if the data is noisy.
Learning Rate ($\nu$)
- As discussed, this is the shrinkage factor for each new tree.
- A small learning rate requires more iterations (larger $M$) to reach good performance but can reduce overfitting risk and lead to more stable convergence.
Tree Complexity: Maximum Depth, Minimum Samples per Leaf, etc.
- Each new tree can range from a stump (depth=1) to quite a large tree.
- Typically, boosting uses smaller, weak trees—this is the original intuition behind the name “boosting.” Smaller trees (like depth=4 or depth=6) are less prone to overfitting.
- You can also control “complexity” by setting a minimum number of data points per leaf (for instance, min_child_samples in LightGBM). This ensures that no leaf is too small, indirectly limiting how specifically the tree can fit local noise.
Regularization Terms ($\lambda, \gamma$)
- We introduced these in the second-order boosting section:
  - $\lambda$ is a L2 regularization parameter on leaf weights. A larger $\lambda$ means the model penalizes large leaf values, leading to smaller updates and smoother fits.
  - $\gamma$ is a penalty term for each additional leaf. A larger $\gamma$ means that any split must yield a sufficiently large reduction in the approximate loss to justify an additional leaf.
- Together, these hyperparameters prevent the new tree from overfitting to the pseudo-residuals.
Subsampling or Sampling Ratios (Row and Column Sampling)
- To reduce variance, speed up training, and avoid overfitting, we can also subsample the training rows and/or the feature columns when fitting a new tree (similar to random forests).
- We will discuss this in more detail later.

6.5.2.1 Practical Notes on Choosing Hyperparameters

One typically tunes these major hyperparameters (learning rate, number of trees, max depth, min samples per leaf, and regularization parameters) by cross-validation or by using a separate validation set.
Grid search or more advanced search methods (Bayesian optimization, hyperband, etc.) systematically explore combinations of these parameters, aiming to find a good balance between training accuracy and validation performance.
Default values in libraries like XGBoost or LightGBM are usually decent starting points; for instance, XGBoost’s default often sets $\nu=0.3$, max_depth=6, $\lambda=1.0$, and $\gamma=0$. But real-world tasks typically benefit from adjusting these defaults significantly.

6.5.3 Complexity and the Bias-Variance Trade-Off

Even though boosting tends to reduce bias by sequentially modeling the “residual error” from prior iterations, we must still be mindful of the variance. If the trees grow too large and the number of boosting rounds is also large, the final model can become extremely flexible and overfit. Conversely, if the learning rate is too small and the maximum depth is too shallow, the model may be underfitting, failing to capture the complexity of the data.

Hence, we can see that hyperparameter tuning for gradient boosted trees is essentially about managing this bias-variance trade-off through multiple levers: tree size, the number of sequential trees, subsampling, and so on.

6.5.4 The Process of Adding a New Tree

To concretize how a new tree is added at each iteration $m$:

Compute Derivatives: We compute the gradient (and Hessian if doing second-order boosting) for each sample $i$, evaluated at $F_{m-1}(x_i)$.
Fit the Tree: Based on these derivatives, we greedily split the data space into leaves that (locally) maximize the gain.
Regularize Leaf Values: We solve for the best leaf values (e.g., $w_j$) with L2 regularization and possibly a leaf penalty ($\gamma$).
Shrink the Tree: We multiply each leaf value by the learning rate $\nu$.
Add to the Ensemble:
\[ F_m(x) \;=\; F_{m-1}(x) \;+\; \nu\, h_m(x). \]

We repeat these steps for $m=1,\dots,M$. Once training finishes (or we decide to early-stop), we have \[ F_M(x) \;=\; \sum_{m=1}^M \nu\, h_m(x). \]

6.5.5 When Does This Process Stop?

Unlike a single decision tree that might stop splitting based on depth or a minimum leaf size, the overall gradient boosting process typically ends when we reach $M$ trees or some early stopping condition. For instance:

Fixed $M$: We decide in advance how many trees to train.
Early Stopping: We monitor the loss (or accuracy, or AUC) on a validation set. If the metric stops improving for (say) 50 consecutive trees, we revert to the best iteration found so far.

By adjusting $\nu$ and $M$, we can precisely control how long we train and how small each incremental improvement is. Indeed, many successful practitioners in Kaggle competitions or industrial applications rely heavily on early stopping to ensure that they do not overfit with an excessively large $M$.

We have now fleshed out how gradient boosting with decision trees functions from a hyperparameter standpoint: controlling learning rate, tree structure, regularization, and the number of iterations. Next, we will turn to practical implementations, focusing on LightGBM as one prominent library that exemplifies these techniques. We will see how LightGBM speeds up tree building, handles categorical features, and offers additional functionality—then we will work through a case study on the UCI Adult dataset.

6.6 Practical Implementation with LightGBM

Having developed a theoretical foundation for gradient boosted decision trees—especially the second-order approximation and associated hyperparameters—let us now turn to LightGBM, a popular open-source framework for gradient boosting. LightGBM, developed by Microsoft Research, distinguishes itself from other implementations (such as XGBoost) through its efficient techniques for building trees on large datasets, its specialized handling of categorical features, and other engineering optimizations that make it both fast and memory-efficient.

In this section, we will highlight the key innovations in LightGBM and explain why they matter from both computational and modeling perspectives. Then, in the following sections, we will walk through a case study applying LightGBM to the UCI Adult dataset.

6.6.1 Key Innovations and Advantages

Histogram-based Algorithm
Traditional decision tree learners consider all possible split points for each feature by sorting the feature values and evaluating split thresholds. This approach, while accurate, can be slow when the dataset and the number of features are large.
LightGBM (like XGBoost’s histogram-based mode) uses binning: each feature’s continuous values are bucketed into a fixed number of bins (e.g., 255 by default). Instead of storing and scanning through all raw feature values, LightGBM scans the bins—effectively a compressed representation. This lowers computation time significantly, especially in repeated scan-and-split steps.
Despite the approximate nature of binning, the impact on model accuracy is often negligible or even beneficial (less overfitting to micro variations).
Gradient-based One-Side Sampling (GOSS)
When computing tree splits, we must compute statistics based on the gradient (and Hessian) for each sample. LightGBM introduces gradient-based one-side sampling to accelerate training on large datasets:
- It keeps all samples with large gradient magnitudes (the ones that are presumably “hard” to fit and thus more critical).
- It randomly subsamples from those with small gradient magnitudes.
  This strategy preserves the samples that carry the most information about current errors while reducing the total data size. It can speed up training substantially.
Exclusive Feature Bundling (EFB)
Another trick for high-dimensional data is that many features may be mutually exclusive (never simultaneously non-zero). In sparse datasets, LightGBM can bundle such features into a single “combined feature” to reduce dimensionality and, hence, reduce overhead in split finding.
Efficient Handling of Categorical Variables
Unlike some other libraries, LightGBM has native support for categorical features. It can automatically detect or be told which columns are categorical and then find smart ways to split them (for instance, ordering categorical values by their accumulated gradient statistics and splitting based on that ordering).
This can outperform naive one-hot encoding, especially when the number of categories is large.
Leaf-wise (Depth-wise) Tree Growth
By default, LightGBM grows its trees in a leaf-wise manner rather than a level-wise manner. It focuses on expanding the leaf with the highest loss reduction (highest split gain) first. This can lead to deeper, more specific leaves in areas of the feature space that need it, while leaving simpler splits in other regions.
However, to manage the potential overfitting risk, LightGBM typically controls the max depth or uses a parameter called num_leaves to limit the total number of leaves in each tree.
Speed and Memory Efficiency
The combination of histogram binning, GOSS, EFB, and leaf-wise tree growth yields a method that is often faster and more memory-friendly than older implementations (e.g., the initial versions of XGBoost) on large datasets.
Furthermore, LightGBM supports multi-threading, GPU training, and various distributed training modes.

6.6.2 Typical Workflow with LightGBM

When using LightGBM, the general steps are similar to other gradient boosted tree workflows:

Prepare the Data
- Handle missing values (LightGBM can handle “NA” values natively in splits).
- Identify which columns are categorical, numeric, or text. For truly large cardinalities, you might consider additional feature engineering.
Create a LightGBM Dataset
- In Python, for example, you can construct a lgb.Dataset (or multiple datasets for training and validation).
Set Hyperparameters
- Key parameters: num_leaves, learning_rate, n_estimators (or num_boost_round), max_depth, min_data_in_leaf, lambda_l2, etc.
- For categorical features, set categorical_feature in Python or pass them into the CLI interface.
Train with Early Stopping
- Provide a validation set to LightGBM.
- Set a parameter such as early_stopping_rounds=50 to stop if validation loss does not improve for 50 rounds.
- After training completes, LightGBM reverts to the best iteration automatically (unless you disable this feature).
Predict and Evaluate
- For classification, LightGBM outputs raw scores or probabilities (depending on the prediction type).
- Evaluate with log loss, accuracy, AUC, etc., as appropriate.

6.6.3 Parameter Tuning Strategies

LightGBM offers many knobs—some the same as other boosting frameworks (learning rate, number of trees, max depth, etc.), and some unique (like num_leaves, which sets a maximum number of leaves per tree). A general tuning strategy might be:

Start with a moderate number of leaves (e.g., 31), a default learning rate (0.1), and a moderate number of boosting rounds.
Enable early stopping on a validation set.
Adjust num_leaves or max_depth if you see underfitting or overfitting (small trees might underfit, large trees might overfit).
Tune regularization parameters (lambda_l1, lambda_l2, min_data_in_leaf, feature_fraction, bagging_fraction) to balance complexity and generalization.
Refine the learning rate and number of iterations once you have a sense of how quickly your model learns.
Check if enabling GOSS or EFB yields speed benefits for your data type.

6.7 Fitting the UCI Adult Dataset Using LightGBM

To see these ideas in action, let us work through a case study using the UCI Adult dataset (sometimes called the “Census Income” dataset) we used in the decision tree chapter. As you’d recall the goal in this dataset is to predict whether an individual earns more than $50K per year, based on features like age, education, occupation, etc.

6.7.1 Preparation Steps

Let’s start by downloading and preparing the data.

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Fetch the Adult dataset from OpenML
adult = fetch_openml(name='adult', version=2, as_frame=True)
df = adult.frame

# Rename target column for clarity
df.rename(columns={'class': 'income'}, inplace=True)

# Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

# Split features and target
X = df.drop(columns='income')
y = df['income']

# Convert target to binary (0/1)
y = (y == '>50K').astype(int)

LightGBM provides excellent built-in support for categorical features, making it particularly efficient for datasets with categorical variables. Unlike some other frameworks that require one-hot encoding, LightGBM can work directly with categorical features - we simply need to specify which columns are categorical and convert them to pandas’ categorical type to take advantage of this capability.

Another powerful feature of LightGBM is its intelligent handling of missing values. Rather than requiring imputation or removal of missing data, LightGBM can use missing values as meaningful information during the tree building process, often finding that missingness itself is a useful signal for prediction.

# Identify categorical features
cat_features = ['workclass', 'education', 'marital-status', 'occupation',
                'relationship', 'race', 'sex', 'native-country']

# Convert to categorical type - this preserves the strings
for col in cat_features:
    X[col] = X[col].astype('category')

Finally we split our dataset into a training and test set

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=102, stratify=y
)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (39073, 14)
Test set shape: (9769, 14)

6.7.2 Fitting the model

Fitting a model with default hyperparameter values is very easy. lightgbm provides a scikit-learn compatible interface with the fit and predict methods you are already familiar with.

import lightgbm as lgb
from sklearn.metrics import accuracy_score, classification_report

# Train the model with default parameters
model = lgb.LGBMClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

[LightGBM] [Info] Number of positive: 9349, number of negative: 29724
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001593 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 719
[LightGBM] [Info] Number of data points in the train set: 39073, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.239270 -> initscore=-1.156685
[LightGBM] [Info] Start training from score -1.156685
Test Accuracy: 0.8726584092537619

Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.94      0.92      7431
           1       0.77      0.66      0.71      2338

    accuracy                           0.87      9769
   macro avg       0.84      0.80      0.82      9769
weighted avg       0.87      0.87      0.87      9769

We already have an improvement in accuracy from what we had got from a single decision tree. In fact we see a particular improvement in the f1 score for the “>50K” class.

We could try to improve model performance further by tuning the hyperparameters, as we did in the chapter on decision trees.

We can control parameters and the fitting process by the arguments to the LightGBMClassifier constructor and the fit methods.

In the code below, we set up a LightGBM classifier with a relatively small learning rate (learning_rate = 0.05) and a large number of potential iterations (n_estimators = 2000). However, iterations are expensive in a number of ways:

Every iteration adds a tree to our additive model. By making the model more flexible, it increases the possibility of overfitting.
Fitting more trees make the fitting step slower.
A model with more trees make calculating predictions slower.

Because of these reasons we only want to use the least number of trees that are helpful. Early stopping is a crucial technique that helps with this. We split our training set further into a new training set and validation set. While fitting with early stopping, in each iteration we calculate the performance of the fitted model on the validation set. If we see the performance on the validation set not improving for a specified number of sets we stop the fitting process.

LightGBM implements early stopping through a callback mechanism. Callbacks are functions that get called at specific points during the training process, allowing us to monitor and control the training. The early_stopping callback provided by LightGBM specifically tracks the validation metric (like accuracy or log loss) and stops training if it hasn’t improved for a specified number of rounds. The reason for waiting for a number of rounds is that the error on the validation set is a noisy measure of the performance of the model which fluctuates from iteration to iteration. So we want to wait for some iterations to make sure that it has reliably reached a minimum. The number of rounds to wait for is specified through the stopping_rounds parameter to early_stopping.

import lightgbm as lgb
from sklearn.metrics import accuracy_score, classification_report

# Split training set into a new training and validation set
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=102, 
    stratify=y_train
)

# Create the model with our choice of parameters
model = lgb.LGBMClassifier(   learning_rate = 0.05,
                              n_estimators = 2000,
                              random_state=42)

# Fit with a validation set, 
#   with an early stopping callback
model.fit(X_train, y_train,
   eval_set =[(X_val,y_val)],
   callbacks=[lgb.early_stopping(stopping_rounds=30)])

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

[LightGBM] [Info] Number of positive: 7479, number of negative: 23779
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001136 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 706
[LightGBM] [Info] Number of data points in the train set: 31258, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.239267 -> initscore=-1.156704
[LightGBM] [Info] Start training from score -1.156704
Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[212]   valid_0's binary_logloss: 0.267351
Test Accuracy: 0.871839492271471

Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.94      0.92      7431
           1       0.78      0.65      0.71      2338

    accuracy                           0.87      9769
   macro avg       0.84      0.80      0.81      9769
weighted avg       0.87      0.87      0.87      9769

In our example, early stopping triggered after approximately 200 iterations, well before reaching the maximum 2000 iterations we allowed. This demonstrates why it’s safe to set a generous upper limit on iterations - early stopping will prevent unnecessary computation while ensuring we don’t stop too early. While our chosen parameters didn’t significantly outperform the default settings in this case, this is quite common in practice. Hyperparameter tuning often yields modest improvements and becomes critical only in applications where even small performance gains translate to significant real-world impact.

6.8 Row and Column Sampling

While gradient boosting decision trees fit tree models sequentially, there is another popular tree ensemble method called random forests which builds an ensemble of trees in parallel by (i) sampling with replacement (bootstrapping) different subsets of the data rows for each tree, and (ii) randomly choosing subsets of features for each node split. Because of this randomization each tree is decorrelated from the other trees and gets a somewhat independent view of the data. For our final prediction we average the prediction over all trees.

LightGBM allows you to combine this idea with gradient boosting to get the best of both worlds:

Row Subsampling (aka Stochastic Gradient Boosting).For each new tree, we randomly sample a fraction (say 80%) of the training data. This fraction is often controlled by the parameter bagging_fraction.
Column Subsampling. For each new tree (or for each node split), we limit the set of candidate features, e.g. 80% of them. This is often controlled by the parameter feature_fraction.

6.9 Conclusion

Gradient boosted decision trees have emerged as one of the most versatile and
powerful tools in machine learning. Their ability to handle both numerical and categorical features with minimal preprocessing makes them an excellent
“out-of-the-box” solution for many real-world problems. Whether working with
financial data, customer behavior, or scientific measurements, these models ca capture complex patterns while remaining computationally tractable.

What makes gradient boosted trees particularly appealing is their balance of
practical utility and theoretical depth. While they work remarkably well in
practice, they also raise fascinating questions about optimization in function spaces, the role of weak learners in ensemble methods, and the interplay betwe model complexity and generalization. The field continues to evolve, with ongoi research exploring both theoretical foundations and practical improvements.

6.10 Exercises

Regression with LightGBM: Use the Boston Housing dataset (available via sklearn.datasets.load_boston()) to predict house prices:
- Load the dataset and split into train/test sets
- Create a LGBMRegressor with default parameters
- Fit the model and evaluate using Mean Squared Error
- Compare the results with a simple linear regression model
- Hint: Use from sklearn.datasets import load_boston and from sklearn.metrics import mean_squared_error
Hyperparameter Tuning: Using the Adult dataset from this chapter:
- Create a parameter grid with:
  - learning_rate: [0.01, 0.05, 0.1]
  - n_estimators: [100, 500, 1000]
  - num_leaves: [31, 63, 127]
- Use sklearn.model_selection.GridSearchCV with 5-fold cross-validation
- Print the best parameters and score
- Hint: Remember to convert categorical columns to categorical
Random Forest Comparison: Compare Random Forest with LightGBM on the Iris dataset:
- Load Iris dataset using sklearn.datasets.load_iris()
- Split into train/test sets
- Fit both:
  - RandomForestClassifier(n_estimators=100, random_state=42)
  - LGBMClassifier(n_estimators=100, random_state=42)
- Compare accuracy and training time
GPU Acceleration: LightGBM supports the use of GPUs for training. For this exercise use a GPU-equipped computer. Use the HIGGS dataset from UCI ML Repository:
- Download the first 100,000 rows of the dataset
- Compare training time between:
  - LGBMClassifier(device='cpu')
  - LGBMClassifier(device='gpu')
- Use same parameters for both: n_estimators=1000
- Hint: The HIGGS dataset has 28 numerical features and 1 binary target
Learning Curves: Using the Adult dataset:
- Train a LightGBM model for exactly 1000 iterations
- Use eval_set to track both training and validation error
- Figure out how to obtain the iteration-wise training and validation error from the fitted model and plot them.
- Identify where overfitting might be occurring
- Hint: Do not use early stopping.

6.11 References

Finally, we close with a brief set of references. These sources provide additional details on specific aspects of gradient boosting, from its historical origins to modern implementations.

Friedman, J. (2001). “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 29(5), 1189–1232.
- The seminal paper outlining gradient boosting in detail.
Friedman, J. (2002). “Stochastic Gradient Boosting.” Computational Statistics & Data Analysis, 38(4), 367–378.
- Introduces stochastic gradient boosting (row subsampling).
Breiman, L. (2001). “Random Forests.” Machine Learning, 45(1), 5–32.
- While not on boosting, it provides background on row and feature subsampling.
Chen, T., & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
- A prominent system for gradient boosted trees, discussing second-order approximation and histogram binning.
Ke, G., et al. (2017). “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” Advances in Neural Information Processing Systems, 30.
- Key paper describing LightGBM’s approach: GOSS, EFB, and leaf-wise growth.
Dong, H., et al. (2018). “A Survey on Gradient Boosting.” IEEE Transactions on Neural Networks and Learning Systems.
- More recent overview of various boosting methods and theoretical underpinnings.

By synthesizing the ideas from these references and practicing on real-world datasets such as UCI Adult, you will develop both the theoretical insight and empirical skills to deploy gradient boosted decision trees effectively. With that, we conclude this chapter, having covered everything from loss functions and second-order residual fitting to modern implementations like LightGBM and the essential hyperparameter tuning strategies that bring this method to life in practical applications.