MemotivaLLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

What is the difference between weight-only and weight-and-activation quantization?

LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

Audio flashcard · 0:19

Nortren·

What is the difference between weight-only and weight-and-activation quantization?

0:19

Weight-only quantization compresses only the model weights, keeping activations in higher precision. It is simpler and preserves quality better. Weight-and-activation quantization compresses both, enabling more aggressive speedups but requiring careful calibration to avoid quality loss. Most production deployments start with weight-only quantization and add activation quantization if needed.
huggingface.co