Question

What is FlashAttention?

Accepted Answer

FlashAttention is an optimized implementation of self-attention that reduces memory usage and increases speed by computing attention in tiles that fit in fast on-chip memory. It avoids materializing the full attention matrix in slow main memory. FlashAttention 2 and 3 progressively improved on this and are now standard in PyTorch, used by virtually all production LLM training and inference.