When solving a problem with Machine Learning, we must carefully prepare our data through “preprocessing” before the learning can take place.
While this seems simple, this is actually the most time-consuming part of the entire machine learning process.
“80% of machine learning is cleaning the data. 20% is complaining about cleaning the data.
-Unknown
During this phase, we have four overarching goals: understand, split, resolve, and enhance.
To begin, we start with a problem, and a dataset.
Understanding comes through exploratory data analysis, also known as “EDA”. We use visualizations and basic statistics to understand which aspects of our data may be useful or problematic.
Next, we split the dataset three ways, so that we can train our model, track its performance, and finally evaluate it on data it has never seen.
Then we move on to resolving issues in our data that we found during EDA. This step includes data cleaning, encoding, and normalizing. These address issues that can lead to inaccurate models, or even prevent the model from training altogether.
Lastly, we may want to enhance our dataset through feature engineering and/or data augmentation. Feature engineering creates new characteristics for our existing examples, while data augmentation creates new examples by slightly modifying current ones.
Overall, while data preprocessing may not be the most fun step, it is absolutely critical for the accurate training of any model.