Question

How do you cache in a RAG system?

Accepted Answer

Cache at multiple layers: embedding results for repeated queries or documents, retrieval results for identical or near-identical queries, reranker outputs by query-candidate pairs, and generated answers for fully repeated requests. Use a key based on the normalized query plus the relevant configuration state. Exact-match caching covers power users with repeated questions, while semantic caching, where embedding similarity decides cache hits, covers paraphrased queries. Invalidate caches when the underlying index or prompts change to avoid serving stale answers.