What is the difference between BPE, WordPiece, and SentencePiece?
LLM Engineer Interview: Tokenization, BPE, SentencePiece, and Token Counting in Production
Audio flashcard · 0:21Nortren·
What is the difference between BPE, WordPiece, and SentencePiece?
0:21
BPE merges the most frequent adjacent token pairs. WordPiece, used by BERT, merges based on likelihood improvement to a language model rather than raw frequency. SentencePiece is a tokenizer framework from Google that operates on raw text without pre-tokenization, treating spaces as regular characters. Llama uses SentencePiece with BPE; GPT uses tiktoken which is also BPE-based.
huggingface.co