Question

What is the difference between Multi-Head Attention, Grouped-Query Attention, and Multi-Query Attention?

Accepted Answer

Multi-Head Attention gives each head its own query, key, and value projections. Multi-Query Attention shares one set of keys and values across all heads to reduce memory bandwidth. Grouped-Query Attention is a middle ground where heads share keys and values in groups. GQA is now standard in Llama 3, Mistral, and most modern LLMs because it dramatically reduces inference memory cost without hurting quality.