RAG & Vector DB Interview: Production RAG, Latency, Caching, Cost, Monitoring

RAG & Vector DB Interview: Production RAG, Latency, Caching, Cost, Monitoring

Understand RAG evaluation metrics, common pitfalls, and production-level considerations such as latency and caching. These insights are critical for deploying robust RAG systems in practical scenarios.

12 audio · 5:50

Nortren·

What are the biggest sources of latency in a production RAG system?

0:30
The biggest latency sources are the language model generation call, typically 1 to 10 seconds depending on output length and model, the embedding call at query time, 50 to 500 milliseconds, the vector search itself, usually 10 to 100 milliseconds, and the reranker if present, 50 to 500 milliseconds. Network round trips between services add variable overhead. To reduce latency, stream generation output to the user, run embedding and retrieval in parallel when possible, cache common queries, and use smaller faster models where acceptable.

How do you cache in a RAG system?

0:31
Cache at multiple layers: embedding results for repeated queries or documents, retrieval results for identical or near-identical queries, reranker outputs by query-candidate pairs, and generated answers for fully repeated requests. Use a key based on the normalized query plus the relevant configuration state. Exact-match caching covers power users with repeated questions, while semantic caching, where embedding similarity decides cache hits, covers paraphrased queries. Invalidate caches when the underlying index or prompts change to avoid serving stale answers.

What is semantic caching and when is it useful?

0:28
Semantic caching embeds incoming queries and checks if any cached query is within a similarity threshold, returning the cached answer instead of re-running the pipeline. It works well for customer support, FAQ, and documentation assistants where users ask slightly different versions of common questions. The trade-off is false positives where two semantically similar queries expect different answers, requiring careful threshold tuning. Tools like GPTCache, Redis vector search, or direct integration with a vector database implement semantic caching.

How do you control cost in a production RAG system?

0:27
Control cost by choosing smaller embedding models and generation models where quality allows, caching to avoid repeat calls, batching embedding requests at ingest, using Matryoshka truncation to reduce storage and search costs, and limiting context window size to the minimum needed per query. Monitor cost per query and per user to identify abusive patterns. At scale, self-hosting embedding and reranker models on GPU infrastructure often beats hosted APIs, while generation usually remains hosted for the frontier model quality.

What should you monitor in a production RAG system?

0:29
Monitor retrieval quality via offline evaluation metrics re-run on a fixed evaluation set after every change, online signals like user feedback and follow-up-question patterns, latency at each pipeline stage, token usage per query for cost tracking, and error rates for each component. Track retrieval-specific metrics like average similarity score distributions, which shift when content drifts from query distributions. Log queries, retrieved documents, and generated answers with unique identifiers to enable post-hoc analysis when users report bad answers.

How do you handle updates to the knowledge base in production?

0:30
Handle updates through either incremental upserts, where you detect changed documents and re-embed only those, or periodic full rebuilds when the corpus is small enough or changes fundamentally. For large corpora, maintain document hashes or timestamps and re-process only modified items. Use stable document IDs so upserts overwrite cleanly. When embedding models change, rebuild is unavoidable since old and new embeddings are incompatible. Plan a migration strategy that queries both indexes during transition, or run indexing offline and swap atomically.

How do you handle documents larger than the context window?

0:30
Chunk them into manageable pieces at ingest, typically 256 to 1024 tokens per chunk with overlap, and retrieve only the most relevant chunks per query. When the top chunks do not fit in the context window, prioritize by relevance score or use parent-child retrieval to return small chunks with context links. For queries that need full-document understanding, summarize each document offline and index summaries for initial retrieval, then fetch full documents only for the most relevant matches. This two-stage approach handles arbitrarily large corpora.

What is context window management in RAG?

0:31
Context window management decides how many retrieved chunks, how much chat history, and how much system prompt fits in a single generation call. Token counting is essential because overflowing the window causes errors or silent truncation. Strategies include dynamically selecting top-k chunks until a token budget is reached, summarizing older chat history, and truncating long chunks with careful boundary selection. Larger context windows have increased available budgets since 2024, but lost-in-the-middle effects still penalize naive packing of many chunks.

How do you prevent prompt injection in RAG systems?

0:29
Prevent prompt injection by treating retrieved content as untrusted data, not as instructions. Use system prompts that clearly separate instructions from retrieved context, for example with explicit delimiters and statements that the model must not follow instructions in the context. For untrusted corpora like public web pages, consider using a smaller model for a classification pass that flags suspicious content before passing to the main generator. No defense is perfect, so limit the blast radius by restricting what actions the generated output can trigger.

How do you implement streaming in a RAG system?

0:27
Stream the language model's output tokens as they are generated so users see the answer appear progressively instead of waiting for the full response. Most language model APIs including OpenAI and Anthropic support server-sent events for token streaming. Before streaming starts, you must complete retrieval and reranking, which adds baseline latency. Some systems also stream retrieval progress indicators so users see that work is happening. Streaming does not reduce total time to last token but dramatically improves perceived latency.

What is the lost-in-the-middle problem in long-context RAG?

0:26
Lost-in-the-middle is the observation that language models pay less attention to information in the middle of long contexts than at the beginning or end, causing them to miss or ignore relevant passages. It was documented by Liu and colleagues in 2023 and applies to both leading proprietary and open-source models. For RAG, this means simply stuffing 50 chunks into a long context performs worse than carefully selecting and ordering 5 to 10 chunks. Put the most important context at the start or end, not buried in the middle.

How do you A/B test changes to a RAG system?

0:32
Use offline evaluation on a fixed labeled set for rapid iteration on retrieval and prompt changes, then roll out winning variants to a small percentage of production traffic for live A/B testing. Measure online signals like user ratings, follow-up patterns, session length, and task completion. Ensure both variants use identical user segments to avoid confounding. Changes to embedding models require shadow indexing before rollout since embeddings from different models are not comparable. Track retrieval quality and generation quality separately to attribute changes correctly. ---