How do you chunk code files for RAG on a codebase?
RAG & Vector DB Interview: Chunking Strategies, Overlap, Size, Semantic Splitting
Audio flashcard · 0:28Nortren·
How do you chunk code files for RAG on a codebase?
0:28
Code chunking should respect syntactic structure, splitting on function, class, and module boundaries rather than line or token counts. Language-aware splitters use tree-sitter or language parsers to find these boundaries, preserving each function as an atomic chunk. Adding the file path, language, and enclosing class name as metadata lets queries filter to the right context. Over-splitting code breaks logical units like a function definition from its docstring, while under-splitting dilutes embeddings across unrelated functions in the same file.
python.langchain.com