MemotivaLLM Engineer Interview: Tokenization, BPE, SentencePiece, and Token Counting in Production

What is the difference between BPE, WordPiece, and SentencePiece?

LLM Engineer Interview: Tokenization, BPE, SentencePiece, and Token Counting in Production

Audio flashcard · 0:21

Nortren·

What is the difference between BPE, WordPiece, and SentencePiece?

0:21

BPE merges the most frequent adjacent token pairs. WordPiece, used by BERT, merges based on likelihood improvement to a language model rather than raw frequency. SentencePiece is a tokenizer framework from Google that operates on raw text without pre-tokenization, treating spaces as regular characters. Llama uses SentencePiece with BPE; GPT uses tiktoken which is also BPE-based.
huggingface.co