MemotivaRAG & Vector DB Interview: Chunking Strategies, Overlap, Size, Semantic Splitting

Should you chunk by token count or character count?

RAG & Vector DB Interview: Chunking Strategies, Overlap, Size, Semantic Splitting

Audio flashcard · 0:27

Nortren·

Should you chunk by token count or character count?

0:27

Token count is more accurate because embedding models have token-based limits, not character limits. A token is roughly three quarters of an English word, so a 1000-character chunk might be 200 to 300 tokens depending on vocabulary density. Using character count as a proxy works for rough prototypes but can overflow model limits on token-dense text like code or URLs, or underuse capacity on whitespace-heavy text. Production splitters use tiktoken or the target model's tokenizer to count tokens exactly.
github.com