*To see this with a code example, check out my Kaggle Notebook.
XGBoost is a super-charged algorithm built from decision trees. Its power comes from hardware and algorithm optimizations which make it significantly faster and more accurate than other algorithms.
data:image/s3,"s3://crabby-images/38203/3820301ba6899dd917ba9e69d14ac2dd75588bf8" alt="Image for post"
XGBoost begins with a default prediction and calculates the “residuals” between the prediction and the actual values.
data:image/s3,"s3://crabby-images/5c07a/5c07ad5c0a92c6b60fa00f5f9d916bc7b11a7dda" alt=""
We use the residuals to calculate “similarity score”, which also introduces a “regularization parameter” to prevent overfitting.
data:image/s3,"s3://crabby-images/7a934/7a934266d3cc0c11f0ee6e8f2e5c9b72e3b1aec3" alt=""
We split the data into a decision tree, and calculate the similarity scores for each leaf, and compare them to the root node to compute a “gain score”.
data:image/s3,"s3://crabby-images/9555b/9555ba86dc217c0937eef457bb2eb7ec06c78092" alt=""
We continue splitting the data until we reach a maximum tree depth, then apply “pruning” to remove any split that did not provide sufficient gain.
data:image/s3,"s3://crabby-images/639ac/639acead76dc1f9cd8b907ac654211a7cc15fb7d" alt=""
The remaining splits are used to calculate new predictions for each subset of the data, which are scaled by a learning rate, and added the previous predictions.
data:image/s3,"s3://crabby-images/eae9e/eae9e97fae6c06e00dc923a1d9ff16531634d506" alt=""
This continues until there is no more improvement, or a maximum number of iterations is reached.
Overall, XGBoost provides state-of-the-art performance in terms of speed and accuracy, making it a go-to algorithm in real-world applications. This is especially true for very large datasets, as XGBoost has some tricks which it uses to keep computation time down.