MemotivaLLM Engineer Interview: Tokenization, BPE, SentencePiece, and Token Counting in Production

What is the difference between tokens and words?

LLM Engineer Interview: Tokenization, BPE, SentencePiece, and Token Counting in Production

Audio flashcard · 0:18

Nortren·

What is the difference between tokens and words?

0:18

Tokens are subword units that may be smaller or larger than words. A common English word like "the" is one token, but "tokenization" might be split into "token" and "ization". Punctuation and spaces are also typically separate tokens. On average, English text tokenizes to roughly 1.3 tokens per word, but this varies widely by content. ---
platform.openai.com