What is multi-head attention?
LLM Engineer Interview Questions: Transformer Architecture, Self-Attention, and Modern LLM Foundations
Audio flashcard · 0:20Nortren·
What is multi-head attention?
0:20
Multi-head attention runs several attention mechanisms in parallel, each with its own learned projections. Each head can focus on different relationships, such as syntactic structure, semantic similarity, or coreference. The outputs of all heads are concatenated and projected back to the original dimension. This gives the model richer representational capacity than a single attention layer.
arxiv.org