When building machine learning models, we use training data to identify patterns, validation data to measure our progress, and test data to evaluate how the model will perform on unseen data.
We get these buckets during data preprocessing by splitting our initial dataset three ways. We want to maximize the amount of training data, while setting aside enough validation & test data to feel confident about the results. A safe ratio is 70%, 15%, 15%.
This step is completed after “EDA” but before we complete our data preprocessing to avoid potential “data leakage”, where the model gets a sneak-peak at information from the test set.
Splitting must be done carefully, and the best approach will vary based on the details of the dataset. In many cases, you’ll be able to randomly sample from the dataset to split into each group.
With time-series data you may need to group the buckets by time, in order to properly simulate new, incoming test data.
In other situations, there may be additional considerations that require a more nuanced approach to splitting the data.
After the data is split, the remaining preprocessing steps of cleaning, encoding, normalization & feature engineering should be carefully completed for all three datasets.