Scroll to read more

Unsupervised learning is major subfield of machine learning.

Its algorithms train on “unlabeled” data, meaning it does not include a value we are learning to predict.

Supervised vs Unsupervised datasets

This makes it applicable to nearly any dataset, but the resulting models give less “direct” answers that often require additional interpretation or processing.

There are two main applications of these algorithms: clustering and dimensionality reduction.

In clustering, we look for groups of data points that are similar to each other.

Example of Clusters

This can be applied to a wide variety of problems such as document classification, fraud detection, and even modeling UFO sightings.

In dimensionality reduction, we distill multiple input variables down through clever mathematical techniques.

This is important for efficient training, but also can be used to visualize high-dimensional datasets which would otherwise be impossible to display.

These unsupervised techniques can be combined with supervised methods to achieve “semi-supervised” learning.

This is particularly useful in cases where unlabeled data is abundant, but labeled data is scarce, such as in pre-training language models.

It should also be noted that there are unsupervised, deep learning algorithms, such as auto-encoders and Boltzmann machines, but they are beyond the scope of this post.