Scroll to read more

Feature Engineering is the data preprocessing step of enhancing the information contained in our raw data, using domain expertise about the specific problem we are trying to solve.

Before creating new features, we should have already completed EDA, train/test split, and data cleaning.

To begin, we should reflect deeply on our problem and identify any features that would help us solve it.

If we identify key features that are not present in our current dataset, we should attempt to add them through supplementation or inference.

We can supplement the features by combining multiple datasets together, which requires that we can “join” the datasets on some shared feature(s).

Inferring features is done by applying some logic to the data we already have.  This tends to be far more complex, but often leads to the most dramatic improvements in model performance.

However, it is critical that the engineered features used for training will also be available when deploying the model, otherwise the results will not be reproducible.

When feature engineering is complete we must remember to encode and scale the new features accordingly.

Overall, this step is critical for optimizing performance, and when done well, offers superior improvements compared to hyperparameter tuning.