What is FlashAttention?
LLM Engineer Interview Questions: Inference Optimization, KV Cache, Speculative Decoding, Quantization
Audio flashcard · 0:20Nortren·
What is FlashAttention?
0:20
FlashAttention is an optimized implementation of self-attention that reduces memory usage and increases speed by computing attention in tiles that fit in fast on-chip memory. It avoids materializing the full attention matrix in slow main memory. FlashAttention 2 and 3 progressively improved on this and are now standard in PyTorch, used by virtually all production LLM training and inference.
arxiv.org