*To see this with a code example, checkout my Kaggle Notebook.
Random Forests harness the simplicity of Decision Trees and the power of “ensemble methods”.
The result is a model that is significantly more accurate.
They work by applying many decision trees across new subsets of the data.
To start, we must first build “bootstrapped” datasets by randomly selecting samples from our original dataset.
Samples can be repeated, resulting in others being omitted. The latter are called the “out-of-bag dataset”, and are used to evaluate the accuracy of the model.
For each dataset, we build a decision tree using a random subset of the available variables at each step, creating greater variety in the resulting decision trees.
This is typically repeated over 100 times resulting in a “random forest” of decision trees that vary based on the data and variables they saw.
To create new predictions, we run a sample through each decision tree and track the prediction
We then aggregate and average the results to generate our final prediction. This act of bootstrapping plus aggregation is known as “bagging”.
Overall, Random Forests vastly outperform decision trees, but in practice they are much slower to train; which is a problem that had been solved through the development of “boosting” algorithms.