“Quantization” is a fancy word for “rounding.”
In modern AI models, there are typically billions of tuned “parameters”.
During training, we “learn” these parameters, typically by comparing the model predictions to a “right answer”, and then adjusting them when it gets the answer wrong.
Training is typically done with 16 or 32-bit floats (numbers with a decimal point).
This results in precise calculations but creates significant memory constraints and increases processing times.
Once training is complete, we use the model to predict. During this phase, it’s common practice to quantize models to enhance speed and reduce memory requirements. However, quantizing below 16b typically requires that we move from “floats” to “integers”.
First, we choose the target quantization level. (Ex: 4-bit, 8-bit)
Then we calculate a scaling factor and a “zero-point” (basically the middle of the range).
Then we simply multiply by the scale, round, and add to the zero-point to get our new values.
Because LLMs are so large, models are often quantized down to 4b, which dramatically increases speed, but limits the weights to only 16 unique values, which can impact “accuracy”.
There has even been research into quantizing weights down to just 3 unique values!