LLM Engineer Interview Questions: Transformer Architecture, Self-Attention, and Modern LLM Foundations

LLM Engineer Interview Questions: Transformer Architecture, Self-Attention, and Modern LLM Foundations

Explore essential topics that form the basis of large language models. Understanding transformer architecture, self-attention mechanisms, and tokenization is crucial for any LLM engineer. This section sets the groundwork for more advanced topics.

14 audio · 4:29

Nortren·

What is a large language model?

0:20
A large language model is a neural network trained on massive text datasets to predict the next token in a sequence. Modern LLMs are typically decoder-only transformers with billions or trillions of parameters. They learn statistical patterns of language during pretraining and can then generate text, answer questions, write code, and reason about problems through prompting.

What is the transformer architecture?

0:17
The transformer is a neural network architecture introduced in the 2017 paper Attention Is All You Need. It replaces recurrence with self-attention, allowing the model to process all tokens in a sequence in parallel. The transformer is the foundation of every major LLM today, including the GPT, Claude, Llama, Gemini, and Mistral families.

What is self-attention?

0:19
Self-attention is a mechanism that lets each token in a sequence look at every other token to compute its own representation. For each token, the model produces a query, a key, and a value vector. The attention score is computed by comparing queries with keys, then used to weight the values. This is how transformers capture relationships between distant tokens.

What are query, key, and value vectors in attention?

0:18
Query, key, and value are three projections of each token's embedding produced by separate linear layers. The query represents what the current token is looking for. The key represents what other tokens offer. The value is the actual content that gets passed forward. Attention scores are computed as the dot product of queries and keys, then applied to the values.

What is multi-head attention?

0:20
Multi-head attention runs several attention mechanisms in parallel, each with its own learned projections. Each head can focus on different relationships, such as syntactic structure, semantic similarity, or coreference. The outputs of all heads are concatenated and projected back to the original dimension. This gives the model richer representational capacity than a single attention layer.

What is the difference between encoder-only, decoder-only, and encoder-decoder transformers?

0:19
Encoder-only models like BERT process the full input bidirectionally and are used for understanding tasks. Decoder-only models like GPT and Llama generate text autoregressively, predicting one token at a time. Encoder-decoder models like T5 first encode input then decode output, suited for translation and summarization. Modern LLMs are predominantly decoder-only.

What is autoregressive language modeling?

0:17
Autoregressive language modeling means generating text one token at a time, where each new token is conditioned on all the previous tokens. The model predicts the probability distribution over the vocabulary for the next token, then samples or selects from it. This is how all decoder-only LLMs generate text.

What is positional encoding and why is it needed?

0:19
Positional encoding adds information about token position to embeddings, because self-attention by itself is order-agnostic and treats input as a set rather than a sequence. Without position information, the model could not distinguish "dog bites man" from "man bites dog". Modern LLMs use rotary position embeddings, which encode position through rotation in vector space.

What is Rotary Position Embedding (RoPE)?

0:20
Rotary Position Embedding, or RoPE, encodes token positions by rotating query and key vectors in pairs of dimensions by an angle proportional to the position. Unlike absolute positional encodings, RoPE naturally captures relative position and extrapolates better to longer sequences than seen in training. It is now standard in models like Llama, Mistral, and Qwen.

What is the difference between Multi-Head Attention, Grouped-Query Attention, and Multi-Query Attention?

0:22
Multi-Head Attention gives each head its own query, key, and value projections. Multi-Query Attention shares one set of keys and values across all heads to reduce memory bandwidth. Grouped-Query Attention is a middle ground where heads share keys and values in groups. GQA is now standard in Llama 3, Mistral, and most modern LLMs because it dramatically reduces inference memory cost without hurting quality.

What is the KV cache and why is it important?

0:19
The KV cache stores the key and value tensors computed for previous tokens during autoregressive generation. Without it, the model would recompute every previous token's keys and values for each new token, making generation quadratic in sequence length. With KV caching, generation cost per new token is approximately linear, which is essential for production inference performance.

What is Mixture of Experts (MoE)?

0:19
Mixture of Experts is an architecture where the feedforward layers are replaced with multiple expert networks, but only a subset is activated for each token. A learned router decides which experts to use. This allows the model to have many more parameters in total while keeping per-token compute low. Models like Mixtral, DeepSeek V3, and GPT-4 use MoE architectures.

What are state space models and how do they compare to transformers?

0:20
State space models like Mamba are sequence models that use linear recurrence instead of attention. They scale linearly with sequence length rather than quadratically, making them efficient for very long contexts. Mamba 2 and other variants emerged in 2023 and 2024 as competitive alternatives to transformers, though pure transformers still dominate at the largest scales as of 2026.

What is the difference between pretraining and fine-tuning?

0:20
Pretraining is the initial training of a language model on a massive corpus of unlabeled text using self-supervised objectives like next-token prediction. Fine-tuning takes a pretrained model and adapts it to a specific task, domain, or instruction-following style using a smaller labeled dataset. Pretraining costs millions of dollars; fine-tuning is typically much cheaper. ---