Question

Should you chunk by token count or character count?

Accepted Answer

Token count is more accurate because embedding models have token-based limits, not character limits. A token is roughly three quarters of an English word, so a 1000-character chunk might be 200 to 300 tokens depending on vocabulary density. Using character count as a proxy works for rough prototypes but can overflow model limits on token-dense text like code or URLs, or underuse capacity on whitespace-heavy text. Production splitters use tiktoken or the target model's tokenizer to count tokens exactly.