RAG & Vector DB Interview: Chunking Strategies, Overlap, Size, Semantic Splitting

RAG & Vector DB Interview: Chunking Strategies, Overlap, Size, Semantic Splitting

Learn about embeddings and the metrics used for measuring similarity, and discover effective chunking strategies. This knowledge is essential for optimizing data retrieval and understanding vector databases.

12 audio · 5:34

Nortren·

What is chunking in RAG and why is it necessary?

0:27
Chunking is the process of splitting documents into smaller pieces before embedding them for retrieval. It is necessary because embedding models have token limits, usually 512 to 8192 tokens, because retrieval precision drops when a single embedding must represent too much content, and because language model context windows cannot fit entire documents. Well-sized chunks let the retriever return focused passages directly relevant to the query rather than entire documents where the answer is buried among unrelated text.

What is the difference between fixed-size, recursive, and semantic chunking?

0:28
Fixed-size chunking splits text every N characters or tokens regardless of content boundaries, which is fast but often breaks sentences mid-thought. Recursive chunking tries a hierarchy of separators, splitting first on paragraphs, then sentences, then words, preserving structure where possible. Semantic chunking uses embedding similarity between adjacent sentences to find natural topic boundaries, grouping related content together. Recursive is the production default, while semantic chunking improves quality at higher compute cost during ingest.

How do you choose the right chunk size for RAG?

0:31
Start with 512 tokens and adjust based on your domain and query patterns. Short chunks of 128 to 256 tokens give precise retrieval but often miss surrounding context, while long chunks of 1024 or more preserve context but dilute the embedding signal and waste context window space. Technical documentation and legal text benefit from larger chunks to retain definitions and clauses, while FAQ or chat-style content works with smaller chunks. Always measure recall and answer quality on real queries before committing to a size.

What is chunk overlap and why is it used?

0:28
Chunk overlap is the number of tokens or characters repeated between adjacent chunks, typically 10 to 20 percent of chunk size. It prevents losing information when a sentence or concept straddles a chunk boundary, which would otherwise split context across two chunks and hurt retrieval for queries about that exact topic. Overlap adds storage cost and creates near-duplicate results during retrieval, which reranking or deduplication can handle. A common default is 50 to 100 tokens overlap on a 512-token chunk.

What is late chunking and how does it improve retrieval?

0:29
Late chunking, introduced by Jina AI in 2024, embeds the entire document first using a long-context model, then derives chunk embeddings from the document-level token representations. This preserves cross-chunk context like pronouns and references to earlier sections that standard chunking destroys. The resulting chunk embeddings carry document-wide semantic context, which improves retrieval on queries that depend on information scattered across sections. It requires a long-context embedding model that supports this mode.

Should you chunk by token count or character count?

0:27
Token count is more accurate because embedding models have token-based limits, not character limits. A token is roughly three quarters of an English word, so a 1000-character chunk might be 200 to 300 tokens depending on vocabulary density. Using character count as a proxy works for rough prototypes but can overflow model limits on token-dense text like code or URLs, or underuse capacity on whitespace-heavy text. Production splitters use tiktoken or the target model's tokenizer to count tokens exactly.

How does chunking affect retrieval quality in RAG?

0:24
Chunking quality directly determines the retrieval ceiling because you cannot retrieve what your chunks have fragmented or muddled together. Too-large chunks dilute embeddings so queries return lukewarm matches on many topics at once, too-small chunks lose surrounding context so the language model cannot use them, and bad boundary placement splits answers in half. Chunking is often the single biggest lever in RAG quality, more impactful than embedding model choice or reranking, especially on long structured documents.

What is the difference between sentence, paragraph, and section chunking?

0:29
Sentence chunking gives very precise retrieval but loses all surrounding context, often producing passages too short to be useful. Paragraph chunking balances precision and context, matching how humans naturally organize ideas. Section chunking, splitting on headings or chapter markers, preserves topical coherence but produces uneven chunk sizes. Most production systems use a hybrid: recursive splitting that prefers paragraph boundaries, falls back to sentences when paragraphs exceed token limits, and keeps section headers as metadata.

How do you chunk code files for RAG on a codebase?

0:28
Code chunking should respect syntactic structure, splitting on function, class, and module boundaries rather than line or token counts. Language-aware splitters use tree-sitter or language parsers to find these boundaries, preserving each function as an atomic chunk. Adding the file path, language, and enclosing class name as metadata lets queries filter to the right context. Over-splitting code breaks logical units like a function definition from its docstring, while under-splitting dilutes embeddings across unrelated functions in the same file.

What metadata should you attach to each chunk for RAG?

0:28
Attach metadata that supports filtering, attribution, and display: source document identifier or URL, page or section number, document title, author or owner, creation and modification dates, and domain-specific tags like product, language, or access-control group. Metadata lets you filter retrieval to specific sources, boost recent documents, restrict results by user permissions, and show users where each answer came from. Vector databases index metadata separately for fast filtered search without scanning every vector.

When should you use semantic chunking over recursive chunking?

0:27
Use semantic chunking when document structure is weak, such as transcripts, long essays, or OCR output without clear headings and paragraphs. It excels at finding topic boundaries in flowing prose where recursive splitting would break mid-idea. Avoid semantic chunking on well-structured documents like API docs or textbooks, where recursive splitting by section and paragraph gives similar results at a fraction of the compute cost. Semantic chunking adds an embedding pass during ingest, which can triple total indexing time on large corpora.

What is parent-child or small-to-big chunking?

0:28
Parent-child chunking indexes small chunks for precise retrieval but returns the larger parent chunk to the language model for rich context. At ingest, you split documents into small embedding chunks of 128 to 256 tokens for recall precision, while also storing larger parent chunks of 1024 to 2048 tokens keyed by chunk identifier. At query time, retrieve the best small chunks, look up their parents, and pass those to the generator. This pattern is standard in LlamaIndex and LangChain production setups. ---