LLM Engineer Interview: Tokenization, BPE, SentencePiece, and Token Counting in Production

Question 1

What is tokenization in LLMs?

Accepted Answer

Tokenization is the process of converting raw text into tokens, the discrete units that the model processes. Tokens are usually subword units, not full words or characters. The tokenizer assigns each token a numeric ID from a fixed vocabulary, typically containing 30 thousand to 200 thousand entries. Every input must be tokenized before the model can process it.

Question 2

What is Byte Pair Encoding (BPE)?

Accepted Answer

Byte Pair Encoding is a tokenization algorithm that starts with individual bytes and iteratively merges the most frequent adjacent pairs to form new tokens. The result is a vocabulary that captures common subwords as single tokens and breaks rare words into multiple pieces. BPE is used by GPT models, Llama, and most modern LLMs because it handles any input text, including unknown words.

Question 3

What is the difference between BPE, WordPiece, and SentencePiece?

Accepted Answer

BPE merges the most frequent adjacent token pairs. WordPiece, used by BERT, merges based on likelihood improvement to a language model rather than raw frequency. SentencePiece is a tokenizer framework from Google that operates on raw text without pre-tokenization, treating spaces as regular characters. Llama uses SentencePiece with BPE; GPT uses tiktoken which is also BPE-based.

Question 4

Why does tokenization matter for cost and latency?

Accepted Answer

Tokenization directly affects both cost and latency because LLM APIs charge by token and inference time scales with token count. A poorly tokenized prompt can use two or three times as many tokens as a well-formatted one. Languages other than English often tokenize less efficiently, which is why Chinese or Arabic prompts tend to cost more per character than English prompts.

Question 5

What is a context window?

Accepted Answer

The context window is the maximum number of tokens an LLM can process in a single forward pass, including both the input prompt and the generated output. As of 2026, frontier models support context windows from 128 thousand up to several million tokens. Larger contexts enable longer conversations and document processing but cost more in compute and memory.

Question 6

What is the lost in the middle problem?

Accepted Answer

Lost in the middle is the observation that LLMs often pay less attention to information placed in the middle of a long context compared to the beginning or end. This means that even models with large context windows do not use them uniformly. For RAG applications, this affects how you order retrieved chunks, often placing the most important content at the start or end.

Question 7

How do you count tokens in a prompt?

Accepted Answer

To count tokens accurately, use the tokenizer of the specific model you are targeting. OpenAI provides the tiktoken library, Anthropic provides a token counting endpoint, and Hugging Face tokenizers work for open models. Approximate rules of thumb like four characters per token or 0.75 words per token are unreliable for non-English text or code.

Question 8

What is the difference between tokens and words?

Accepted Answer

Tokens are subword units that may be smaller or larger than words. A common English word like "the" is one token, but "tokenization" might be split into "token" and "ization". Punctuation and spaces are also typically separate tokens. On average, English text tokenizes to roughly 1.3 tokens per word, but this varies widely by content.

---