Scroll to read more

*To see this with sample code, check out my Kaggle Notebook.

CatBoost, short for Category Boosting, is an algorithm that is based on decision trees and gradient boosting like XGBoost, but with even better performance!

CatBoost does especially well with data containing “categorical variables.”

In other models, categorical variables are handled through “OneHotEncoding which creates additional columns to capture the information.

Alternatively, CatBoost starts by shuffling the data, creating “permutations”.

For each, it assigns a “default” value for each class to the first few examples.

Next, it calculates the value in each new row by looking at previous examples with the same class, and counting the number of positive labels, then performing a calculation.

This captures additional valuable information, avoids “sparsity”, and speeds up computation.

Then the model proceeds by building “symmetric binary trees” for each permutation of the data.

To avoid overfitting, CatBoost builds new models at each step (n), by shuffling the rows and looking at n^2 previous examples.

CatBoost also automatically handles hyperparameter tuning and can even run on GPU’s resulting in incredible speedups.

Overall, CatBoost is an extremely fast, accurate, and innovative algorithm, yet somehow it is not as widely used as its predecessors like XGBoost.  So if you haven’t already tried it, go implement it ASAP!

CatBoost Github