What is quantization in LLM inference?
LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization
Audio flashcard · 0:20Nortren·
What is quantization in LLM inference?
0:20
Quantization reduces the precision of model weights from 16 or 32 bits down to 8, 4, or even 2 bits per parameter. This shrinks memory and speeds up inference, with modest quality loss when done carefully. Common methods include GPTQ, AWQ, and GGUF quantization. A 4-bit quantized 70-billion-parameter model can run on a single high-end consumer GPU.
huggingface.co