Scroll to read more

Encoding is the essential data preprocessing step of converting raw data into a format that is interpretable and efficient for an algorithm to learn from. 

Typically, this is done on “categorical” data, where the values belong to a limited number of discrete groups.  If this data is not encoded, the algorithm will fail to train.

The categorical data can be ordinal, and have an inherent order, such as grades or sizes.

Or Nominal, where there is no inherent order, such as types of animals or flavors.

When encoding ordinal data, we should retain the ordering when converting from categories into numbers. 

With nominal data, we typically employ “OneHotEncoding”, as long as the number of categories is not extremely high.  This creates a new column for each possible category, containing only a 0 or 1.

The optimal approach for encoding will vary by dataset, but in most cases, Ordinal and OneHotEncoding will be sufficient to process your categorical data for training.

However, when dealing with Natural Language Processing, we use a more complex form of encoding using “tokenization” and “embeddings”.  This converts letters or words into a numerical representation that contain additional information about language and context to help the algorithms learn.