LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

This section focuses on fine-tuning methods and inference optimization techniques that are pivotal for deploying LLMs in production. Key concepts include LoRA, QLoRA, and speculative decoding, empowering engineers to improve model efficiency and effectiveness.

13 audio · 4:10

Nortren·

What is the difference between training and inference for LLMs?

0:21
Training involves forward and backward passes, gradient computation, and weight updates, processing large batches at once. Inference only does forward passes, one or a few sequences at a time, with autoregressive token-by-token generation. Inference is much cheaper per token but happens vastly more often, making inference optimization the main cost lever in production.

What is TTFT and why does it matter?

0:18
TTFT stands for Time To First Token, the latency from request to the first generated token. It matters because users perceive responsiveness through TTFT, not through total generation time. For chat applications, TTFT under one second feels instant. TTFT is dominated by the prefill phase, where the model processes the entire prompt before generating any output.

What is the difference between prefill and decode in LLM inference?

0:20
Prefill is the initial forward pass that processes all input tokens in parallel and builds the KV cache. Decode is the subsequent autoregressive generation, producing one token per forward pass. Prefill is compute-bound and benefits from parallelism, while decode is memory-bandwidth-bound and benefits from KV cache optimization. Production systems optimize them separately.

What is speculative decoding?

0:19
Speculative decoding uses a small fast draft model to propose several tokens at once, then verifies them with the large target model in a single forward pass. Tokens that match are accepted; the first mismatch and onward are discarded. This can speed up generation by two to three times because the verification of multiple tokens is nearly free compared to generating them one at a time.

What is continuous batching?

0:17
Continuous batching, also called in-flight batching, is a serving technique where new requests can join the batch as soon as a slot opens, rather than waiting for the entire batch to finish. This dramatically improves throughput in production by keeping the GPU busy. vLLM, TensorRT-LLM, and most modern inference servers implement continuous batching.

What is PagedAttention and how does vLLM use it?

0:21
PagedAttention is a memory management technique inspired by virtual memory in operating systems. Instead of allocating contiguous memory for the KV cache of each request, it splits memory into fixed-size pages and uses a lookup table. This eliminates fragmentation and lets vLLM serve more concurrent requests with the same GPU memory. It is one of the main reasons vLLM is so popular for self-hosted inference.

What is quantization in LLM inference?

0:20
Quantization reduces the precision of model weights from 16 or 32 bits down to 8, 4, or even 2 bits per parameter. This shrinks memory and speeds up inference, with modest quality loss when done carefully. Common methods include GPTQ, AWQ, and GGUF quantization. A 4-bit quantized 70-billion-parameter model can run on a single high-end consumer GPU.

What is the difference between weight-only and weight-and-activation quantization?

0:19
Weight-only quantization compresses only the model weights, keeping activations in higher precision. It is simpler and preserves quality better. Weight-and-activation quantization compresses both, enabling more aggressive speedups but requiring careful calibration to avoid quality loss. Most production deployments start with weight-only quantization and add activation quantization if needed.

What is GGUF and llama.cpp?

0:19
GGUF is a quantized model format used by llama.cpp, a high-performance C++ inference engine for LLMs. GGUF files contain quantized weights along with metadata, designed for fast loading and CPU or GPU inference. llama.cpp is the most popular way to run open-source LLMs locally on Mac, Windows, and Linux without GPU dependencies.

What is FlashAttention?

0:20
FlashAttention is an optimized implementation of self-attention that reduces memory usage and increases speed by computing attention in tiles that fit in fast on-chip memory. It avoids materializing the full attention matrix in slow main memory. FlashAttention 2 and 3 progressively improved on this and are now standard in PyTorch, used by virtually all production LLM training and inference.

How do you reduce LLM inference latency?

0:21
Reduce inference latency by quantizing the model, using a smaller variant, enabling continuous batching, applying speculative decoding, using FlashAttention, optimizing the prompt to reduce input tokens, streaming output to improve perceived latency, and caching previous responses for repeated queries. Each technique helps a different bottleneck; profile first to find the actual constraint.

How do you reduce LLM inference cost?

0:17
Reduce cost by routing simple queries to smaller models, caching common requests, using prompt compression, reducing few-shot examples after collecting fine-tuning data, deploying open models for high-volume tasks while reserving frontier models for hard cases, and monitoring per-feature spend to catch regressions early.

What is prompt caching?

0:18
Prompt caching reuses the KV cache from the prefix of a previous request when a new request shares the same prefix. This skips redundant prefill computation, dramatically reducing latency and cost for repeated system prompts or long contexts. OpenAI, Anthropic, and Google all support prompt caching in their APIs as of 2026. ---