Should you chunk by token count or character count?
RAG & Vector DB Interview: Chunking Strategies, Overlap, Size, Semantic Splitting
Audio flashcard · 0:27Nortren·
Should you chunk by token count or character count?
0:27
Token count is more accurate because embedding models have token-based limits, not character limits. A token is roughly three quarters of an English word, so a 1000-character chunk might be 200 to 300 tokens depending on vocabulary density. Using character count as a proxy works for rough prototypes but can overflow model limits on token-dense text like code or URLs, or underuse capacity on whitespace-heavy text. Production splitters use tiktoken or the target model's tokenizer to count tokens exactly.
github.com