Adaptive Boosting vs Gradient Boosting

Boosting comes from the idea that a weak learner (i.e., models) can be enhanced by learning based on the errors by other weak learners. Grouping these weak learners together results in a strong learner.

It is an ensemble method that combines multiple weak learners (i.e., error rate around 0.5) to obtain a strong learner (i.e., error rate around 0). Weak learners are trained upon observations that are misclassified (or difficult to classify) and combined together to create a strong learner. It is implicit that the weak learners have low correlations with each other and therefore can be combined to be a strong learner. The lower the correlations between the weak learners, the stronger the ensemble of classifiers will be.

The two main boosting algorithms are Adaptive Boosting and Gradient Boosting. XGBoot, LightGBM and CatBoost are basically different implementations of Gradient Boosting.

  • Adaptive Boosting (Adaboost)

  • Gradient Boosted Trees
    • XGBoost - Released by Tianqi Chen (March, 2014)

    • LightGBM - Released by Microsoft (Jan 2017)

    • CatBoost - Released by Yandex (April 2017)

Adaptive Boosting (AdaBoost)

  • Alters the distribution of the training dataset to increase weights on sample observations that are difficult to classify.

  • The final prediction is based on a majority vote of the weak learner's predictions weighted by their individual accuracy.

Weak learners applied are decision stumps (i.e., decision trees with a single split or one-level).

The observations in the dataset are weighted such that a greater weight on observations that are misclassified (difficult to classify) versus the observations that have been classified correctly by the model. New weak learners are added sequentially to focus training on the difficult observations. In this manner, observations that are difficult to classify are weighted higher to the point that the algorithm identifies a model that accurately classifies these observations. Adaptive boosting changes the sample distribution being used for training.

Intuitively, Adaboost is known as a step-wise additive model. Adaboost initially specifies a set of weak learners and the boosting process is to learn how to add the weights of these weak learners to be a strong learner. The weight of each learner is based on whether it predicts each observation accurately or not. If the learner mispredicts an observation, its weight is reduced. This process is repeated until convergence.

Predictions are made by using a majority vote of the weak learner's predictions (i.e., using a set of weak learners and seeing which classification outcome is voted for the most) weighted by their individual accuracy.

Gradient Boosted Trees

  • The approach trains learners based upon minimizing the loss function of a strong learner (i.e., training on the residuals of the strong model) as an alternative means to focus on training upon misclassified observations.

  • The contribution of each weak learner to the final prediction is based on a gradient optimization process to minimize the overall error of the strong learner.

Weak learners are decision trees constructed in a greedy manner with split points based on purity scores (i.e., Gini, minimize loss) thus larger trees can be used around 4 to 8 levels. Learners should still remain weak and so they should be constrained (i.e., maximum number of layers, nodes, splits, leaf nodes)

Gradient boosting re-defines boosting as a numerical optimization problem where the objective is to minimize the loss function of the model by adding weak learners using a gradient-descent like procedure. As gradient boosting is based on minimizing a loss function, different types of loss functions can be used resulting in a flexible technique that can be applied to regression, multi-class classification, etc.

Intuitively, gradient boosting is a stage-wise additive model that generates learners during the learning process (i.e., trees are added one at a time, and existing trees in the model are not changed). It builds a first learner to predict the observations in the training dataset, and then it calculates the loss (i.e., the value between the first learner's outcomes and the actual values). It will build a second learner that is fitted/trained on the residual error produced by the first learner to predict the loss after the first step and continue to do so until it reachest a threshold (i.e., residuals are zero). By training the second learner on the gradient of the error with respect to the loss predictions of the first learner, the second learner is being trained to correct the mistakes of the first model. This is the core of gradient boosting, and allows many simple learners to compensate for each other’s weaknesses to better fit the data.

Gradient boosting does not modify the sample distribution as weak learners train on the remaining residual errors of a strong learner (i.e, pseudo-residuals). By training on the residuals of the model, this is an alternative means to give more importance to misclassified observations. Intuitively, new weak learners are being added to concentrate on the areas where the existing learners are performing poorly.

The contribution of the weak learner to the ensemble is based on a gradient descent optimization process. The calculated contribution of each tree is based on minimizing the overall error of the strong learner.

Extreme Gradient Boosting (XGBoost)

XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split. The histogram-based algorithm splits all the data points for a feature into discrete bins and uses these bins to find the split value of histogram.

XGBoost cannot handle categorical features by itself, it only accepts numerical values similar to Random Forest. Therefore one has to perform various encodings like label encoding, mean encoding or one-hot encoding before supplying categorical data to XGBoost.

Gradient Boosting algorithm also called gradient boosting machine including the learning rate. Stochastic Gradient Boosting with sub-sampling at the row, column and column per split levels. Regularized Gradient Boosting with both L1 and L2 regularization.

Clever Penalisation of Trees A Proportional shrinking of leaf nodes Newton Boosting Extra Randomisation Parameter

In XGBoost the trees can have a varying number of terminal nodes and left weights of the trees that are calculated with less evidence is shrunk more heavily.

Newton Boosting uses Newton-Raphson method of approximations which provides a direct route to the minima than gradient descent.

The extra randomisation parameter can be used to reduce the correlation between the trees, as seen in the previous article, the lesser the correlation among classifiers, the better our ensemble of classifiers will turn out.

Light GBM

LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value

LightGBM can also handle categorical features by taking the input of feature names. It does not convert to one-hot coding, and is much faster than one-hot coding. LGBM uses a special algorithm to find the split value of categorical features.

Cat Boost

CatBoost has the flexibility of giving indices of categorical columns so that it can be encoded as one-hot encoding using one_hot_max_size (Use one-hot encoding for all features with number of different values less than or equal to the given parameter value).

If you don’t pass any anything in cat_features argument, CatBoost will treat all the columns as numerical variables.

Catboost improves over LightGBM by handling categorical features better. Traditionally categorical features are one-hot-encoded, this incurs the curse of dimensionality if the feature has many distinct values. Previous algorithms tackled this issue of sparse features as we saw with EFB above. Catboost deals with categorical features by, “generating random permutations of the dataset and for each sample computing the average label value for the sample with the same category value placed before the given one in the permutation”. They also process the data with GPU acceleration, and do feature discretization into a fixed number of bins (128 and 32).


Comments powered by Disqus