Scroll to read more

*To see this with a code example, check out my Kaggle Notebook.

XGBoost is a super-charged algorithm built from decision trees.  Its power comes from hardware and algorithm optimizations which make it significantly faster and more accurate than other algorithms.

Image for post

XGBoost begins with a default prediction and calculates the “residuals” between the prediction and the actual values.

We use the residuals to calculate “similarity score”, which also introduces a “regularization parameter” to prevent overfitting.

We split the data into a decision tree, and calculate the similarity scores for each leaf, and compare them to the root node to compute a “gain score”.

We continue splitting the data until we reach a maximum tree depth, then apply “pruning” to remove any split that did not provide sufficient gain.

The remaining splits are used to calculate new predictions for each subset of the data, which are scaled by a learning rate, and added the previous predictions.

This continues until there is no more improvement, or a maximum number of iterations is reached.  

Overall, XGBoost provides state-of-the-art performance in terms of speed and accuracy, making it a go-to algorithm in real-world applications. This is especially true for very large datasets, as XGBoost has some tricks which it uses to keep computation time down.