The first step in data preprocessing is to develop a deep understanding of your dataset through exploratory data analysis. This process is typically unstructured, but there are some basic questions that should be answered every time.
data:image/s3,"s3://crabby-images/0b92a/0b92a97692e812d314b152105b29e2d53ab33dfa" alt=""
We should also aim to identify issues that must be addressed during the data cleaning phase.
data:image/s3,"s3://crabby-images/7dc0a/7dc0a7e1b473415699e5b40f8206f27f41903260" alt=""
Prior to starting, we must have some “domain knowledge” of what the dataset is all about. If we do not, we should gather more context on the problem and data.
data:image/s3,"s3://crabby-images/6bedb/6bedb82e4fa8dd8821b2fe299c5769beea894294" alt=""
To begin, we should understand the size of our dataset and the information it contains.
data:image/s3,"s3://crabby-images/eb1c2/eb1c2d6836c01fb2cdda620e44971a6a4bb18176" alt=""
Next, we should understand the basic statistics of our numerical values.
data:image/s3,"s3://crabby-images/5fefb/5fefbff1be34d100521156780303506c0169d841" alt=""
During this process we should identify missing values, outliers, and errors in the dataset, as these can lead to serious issues during training. We must address these during data cleaning.
data:image/s3,"s3://crabby-images/0ca28/0ca28d00f4781cce2b2a7363323f7fb6aacce841" alt=""
Finally, we should build many visualizations, as these will help us quickly identify patterns that exist in our data. Common examples include:
Barcharts:
data:image/s3,"s3://crabby-images/cc737/cc737b80c3d075bb0680e3d57ba0a692b908f608" alt=""
Histograms:
data:image/s3,"s3://crabby-images/e08ab/e08ab94bae26983394725bba0f2f3a8fc22f7dbb" alt=""
Box Plots:
data:image/s3,"s3://crabby-images/af005/af005a2bb5d5f87c2a51660e29b524e793f5a95c" alt=""
Scatter Plot:
data:image/s3,"s3://crabby-images/4189f/4189f76b588928daea8da7ecbad1ed71ef942998" alt=""
Correlation matrix:
data:image/s3,"s3://crabby-images/19376/19376c76b287a7b599a4583a1a5aab4c01329299" alt=""
By the end of the EDA process, you should understand: The nuances of your dataset, the issues that must be resolved, and the new features you want to create from the existing data.