MemotivaLLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

What is FlashAttention?

LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization

Audio flashcard · 0:20

Nortren·

What is FlashAttention?

0:20

FlashAttention is an optimized implementation of self-attention that reduces memory usage and increases speed by computing attention in tiles that fit in fast on-chip memory. It avoids materializing the full attention matrix in slow main memory. FlashAttention 2 and 3 progressively improved on this and are now standard in PyTorch, used by virtually all production LLM training and inference.
arxiv.org