Scroll to read more

Data cleaning is often the most time-consuming step in the machine learning process.  The goal is address problematic examples, features, and formatting, so that the model can successfully learn from the data.

First, we must address the formatting, as each column should represent one feature that is interpretable by the algorithm.

We must also address the consistency of the data.  For example, timestamps may be listed in local times, which creates major problems when working with national or global data.

Next, we can look for and remove redundant features, which are columns that provide the same basic information as one-another. These add computational overhead without improving model performance.

We can repeat this with duplicated examples in our dataset, as they can lead to overfitting.

Next we should address outliers, which are data points that are significantly outside of the normal distribution of values, either due to an error or an unusual event.  In many cases, removing these examples will improve the accuracy of the model.

Finally, we must deal with missing values, typically by dropping the example or by filling in the value using the column average or some other heuristic.

Additionally, each dataset may present unique challenges that will need to be addressed on a case-by-case basis.