LLM Engineer Interview: Tokenization, BPE, SentencePiece, and Token Counting in Production
Explore essential topics that form the basis of large language models. Understanding transformer architecture, self-attention mechanisms, and tokenization is crucial for any LLM engineer. This section sets the groundwork for more advanced topics.
Tokenization is the process of converting raw text into tokens, the discrete units that the model processes. Tokens are usually subword units, not full words or characters. The tokenizer assigns each token a numeric ID from a fixed vocabulary, typically containing 30 thousand to 200 thousand entries. Every input must be tokenized before the model can process it.
Byte Pair Encoding is a tokenization algorithm that starts with individual bytes and iteratively merges the most frequent adjacent pairs to form new tokens. The result is a vocabulary that captures common subwords as single tokens and breaks rare words into multiple pieces. BPE is used by GPT models, Llama, and most modern LLMs because it handles any input text, including unknown words.
What is the difference between BPE, WordPiece, and SentencePiece?
0:21
BPE merges the most frequent adjacent token pairs. WordPiece, used by BERT, merges based on likelihood improvement to a language model rather than raw frequency. SentencePiece is a tokenizer framework from Google that operates on raw text without pre-tokenization, treating spaces as regular characters. Llama uses SentencePiece with BPE; GPT uses tiktoken which is also BPE-based.
Why does tokenization matter for cost and latency?
0:20
Tokenization directly affects both cost and latency because LLM APIs charge by token and inference time scales with token count. A poorly tokenized prompt can use two or three times as many tokens as a well-formatted one. Languages other than English often tokenize less efficiently, which is why Chinese or Arabic prompts tend to cost more per character than English prompts.
The context window is the maximum number of tokens an LLM can process in a single forward pass, including both the input prompt and the generated output. As of 2026, frontier models support context windows from 128 thousand up to several million tokens. Larger contexts enable longer conversations and document processing but cost more in compute and memory.
Lost in the middle is the observation that LLMs often pay less attention to information placed in the middle of a long context compared to the beginning or end. This means that even models with large context windows do not use them uniformly. For RAG applications, this affects how you order retrieved chunks, often placing the most important content at the start or end.
To count tokens accurately, use the tokenizer of the specific model you are targeting. OpenAI provides the tiktoken library, Anthropic provides a token counting endpoint, and Hugging Face tokenizers work for open models. Approximate rules of thumb like four characters per token or 0.75 words per token are unreliable for non-English text or code.
Tokens are subword units that may be smaller or larger than words. A common English word like "the" is one token, but "tokenization" might be split into "token" and "ization". Punctuation and spaces are also typically separate tokens. On average, English text tokenizes to roughly 1.3 tokens per word, but this varies widely by content.
---