What is tokenization in LLMs?
LLM Engineer Interview: Tokenization, BPE, SentencePiece, and Token Counting in Production
Audio flashcard · 0:20Nortren·
What is tokenization in LLMs?
0:20
Tokenization is the process of converting raw text into tokens, the discrete units that the model processes. Tokens are usually subword units, not full words or characters. The tokenizer assigns each token a numeric ID from a fixed vocabulary, typically containing 30 thousand to 200 thousand entries. Every input must be tokenized before the model can process it.
huggingface.co