Question

What is quantization in LLM inference?

Accepted Answer

Quantization reduces the precision of model weights from 16 or 32 bits down to 8, 4, or even 2 bits per parameter. This shrinks memory and speeds up inference, with modest quality loss when done carefully. Common methods include GPTQ, AWQ, and GGUF quantization. A 4-bit quantized 70-billion-parameter model can run on a single high-end consumer GPU.