Augmentation is a data preprocessing step that allows you to significantly increase the quantity of training data you have available by slightly modifying examples within your dataset.
This is typically used for supervised learning, where all data must be labeled. Augmentation allows us to transfer the labels from a base example to newly generated examples, saving significant time and money.
Augmentation is extremely useful in the field of computer vision, as augmentations can simulate a variety of settings using a small amount of data.
Common image augmentations include:
(Screenshots courtesy of Roboflow, which makes augmentation effortless!)
Flip:
Rotate
Crop/Zoom
Shear
Hue
Saturation
Brightness
Greyscale
Cutout/Occlusion
Blur
Noise
Experimentation is required to determine the “best” augmentations for each specific problem, especially as over-augmentation can decrease model performance.
Ideally, this is done through an “ablation study”, where augmentations are tested one at a time to isolate the performance impacts and determine the optimal combination.
Recently, augmentation has been taken to the next level, through a process called synthetic data generation, where 3d models are created and then simulated in a number of environments.
Similar techniques can be applied to other domains such as Natural Language Processing to create robust training sets from limited data.