*To see this with a code example, checkout my Kaggle Notebook.
Random Forests harness the simplicity of Decision Trees and the power of “ensemble methods”.
data:image/s3,"s3://crabby-images/4fd26/4fd26c679f583fa7651ea408fbccc52685f2fcd4" alt=""
The result is a model that is significantly more accurate.
data:image/s3,"s3://crabby-images/2af07/2af073fca65932bfb1403aa407eb173e8a35146d" alt="Machine Learning Methods: Decision trees and forests - 菜鸡一枚 - 博客园"
They work by applying many decision trees across new subsets of the data.
data:image/s3,"s3://crabby-images/0bddf/0bddf270a98ca2523347c37c46ef2c98bd656f37" alt=""
To start, we must first build “bootstrapped” datasets by randomly selecting samples from our original dataset.
data:image/s3,"s3://crabby-images/91c19/91c190cfe533a7dca33bb74f2ccf46c8857dbf3f" alt=""
Samples can be repeated, resulting in others being omitted. The latter are called the “out-of-bag dataset”, and are used to evaluate the accuracy of the model.
data:image/s3,"s3://crabby-images/9ef08/9ef08ae39c22a0fcac934a4b0e08522f8fa28b51" alt=""
For each dataset, we build a decision tree using a random subset of the available variables at each step, creating greater variety in the resulting decision trees.
data:image/s3,"s3://crabby-images/e4ca0/e4ca0fb2332c406d075d9c0fa8c0135e0dcef431" alt=""
This is typically repeated over 100 times resulting in a “random forest” of decision trees that vary based on the data and variables they saw.
data:image/s3,"s3://crabby-images/09e0a/09e0a0606fa6878f151b79220eb1dd0cb2ad88b0" alt=""
To create new predictions, we run a sample through each decision tree and track the prediction
data:image/s3,"s3://crabby-images/f9a05/f9a05b701ce30e8a8ffd7d90764f44d442ff01a2" alt=""
We then aggregate and average the results to generate our final prediction. This act of bootstrapping plus aggregation is known as “bagging”.
data:image/s3,"s3://crabby-images/41cb7/41cb7758e4c799e6d183ace281905eff89d93443" alt=""
Overall, Random Forests vastly outperform decision trees, but in practice they are much slower to train; which is a problem that had been solved through the development of “boosting” algorithms.