Question

What is self-attention?

Accepted Answer

Self-attention is a mechanism that lets each token in a sequence look at every other token to compute its own representation. For each token, the model produces a query, a key, and a value vector. The attention score is computed by comparing queries with keys, then used to weight the values. This is how transformers capture relationships between distant tokens.