Sources

https://www.maartengrootendorst.com/blog/quantization/

https://www.youtube.com/watch?v=mii-xFaPCrA

https://www.youtube.com/watch?v=0VdNflU08yA

Quantization came into effect because large language models contain billions of parameters, which are weights. During inference activations are created as a product of the input weights, which are similarly very large. If we really want to store all these weights actively in our memory of the GPU, it will take a lot of space.

To store them by optimizing them first so that they don't take that much of space, that is why we bring quantization into the picture.

How do we represent numerical values?

Represented as “bits”, or binary digits which are represented as - sign, exponent, fraction(mantissa)

The the more bits we use to represent a value, the more precise it really is.

The more bits we have available the larger the range of the values that can be represented.

Screenshot 2026-02-20 at 10.53.51 AM.png

Dynamic range - the range that a representation like FP32 or FP16 can present - like min this number, max this number.

Precision - the distance between two neighboring values.