RAG & Vector DB Interview: Common RAG Mistakes, Pitfalls, System Design Questions

RAG & Vector DB Interview: Common RAG Mistakes, Pitfalls, System Design Questions

Understand RAG evaluation metrics, common pitfalls, and production-level considerations such as latency and caching. These insights are critical for deploying robust RAG systems in practical scenarios.

12 audio · 5:35

Nortren·

What are the most common mistakes in building a RAG system?

0:28
Common mistakes include chunking too aggressively or too naively, using a generic embedding model when a domain-tuned one would win, skipping reranking even when it would lift precision substantially, trusting retrieval without evaluation on real queries, ignoring metadata filtering for tenant isolation or freshness, not handling documents larger than context windows, and building no feedback loop from user signals back into evaluation. Each of these silently degrades quality until users complain loudly.

Why does my RAG system retrieve irrelevant documents?

0:28
The usual causes are poorly tuned chunking that creates embeddings too broad or too narrow, an embedding model that does not match the domain vocabulary, missing metadata filters so queries mix unrelated content, queries phrased very differently from documents causing vocabulary mismatch, or a single-retrieval pipeline with no reranker to fix first-stage errors. Diagnose by inspecting failed queries individually: which chunk should have been retrieved, where did it rank, and what scored higher. Patterns in the failures point to the fix.

Why does my RAG system hallucinate even with retrieved context?

0:27
Hallucinations with context usually mean the retrieved chunks do not actually contain the answer, or they contradict each other, or the prompt does not strongly instruct the model to answer only from context. Measure faithfulness on a sample of failed queries to confirm. Fix by improving retrieval quality so relevant chunks actually rank high, adding explicit instructions to the prompt like "answer only from the provided context and say you do not know if the context is insufficient," and using a reranker to filter irrelevant chunks from the top-k.

Why is my vector search slow in production?

0:28
Common causes are an index not fully in memory, forcing disk reads, too-high ef_search or similar parameters that over-explore the graph, too-large top-k that forces the index to find more candidates than needed, highly selective filters that force many graph nodes to be checked, and insufficient replicas for query concurrency. Profile with database metrics to find the bottleneck. Solutions include adding memory, tuning index parameters, reducing k, adding payload or metadata indexes for filter efficiency, and scaling read replicas.

How would you design a RAG system for 100 million documents?

0:29
Use a distributed vector database like Milvus, Qdrant Cloud at scale, or Pinecone with proper sharding and replicas. Partition by tenant, date, or document type to keep individual index sizes manageable. Choose HNSW with scalar quantization for memory efficiency, or DiskANN for disk-based scale. Implement a three-stage retrieval pipeline of BM25 or sparse first stage for recall, dense vector retrieval as second stage, and cross-encoder reranking for top precision. Add caching, streaming, and monitoring at every layer.

How would you design a multi-tenant RAG system?

0:28
For moderate tenant counts, use one index per tenant for strong isolation and independent scaling, accepting operational overhead. For high tenant counts, use a shared index with tenant ID in metadata and mandatory filtering on every query, using vector databases like Qdrant or Weaviate that support efficient filtered search. Enforce access control at the application layer before any query reaches the database. Consider per-tenant resource quotas to prevent noisy neighbors, and audit logging to satisfy compliance requirements.

How would you design a RAG system that handles both documents and structured data?

0:28
Route queries based on intent using a classifier or language model, sending structured queries to a SQL database or search index and unstructured queries to vector retrieval. For hybrid queries that need both, run them in parallel and combine results in the prompt, clearly labeling which facts come from which source. Tools like text-to-SQL for database queries and vector search for documents can share a single agent that decides which tools to invoke. Self-query retrievers handle structured filters over metadata within the vector store.

How do you handle conversational context in a RAG chatbot?

0:27
For each turn, rewrite the user's query using conversation history to produce a standalone question that can be retrieved against, for example resolving pronouns or implicit references. Run retrieval on the rewritten query and pass both the conversation history and retrieved context to the generator. Truncate history when it grows long using summarization or sliding window strategies, since embeddings from older turns may not match the current topic. Libraries like LangChain and LlamaIndex offer built-in conversational retrieval chains.

How would you design a RAG system with low latency requirements?

0:26
Use a smaller faster generation model or distill a larger one, skip the reranker or use a small reranker, pre-embed common queries, warm the vector index into memory, stream generation output for better perceived latency, and run embedding and retrieval in parallel where possible. Consider quantized embeddings to reduce search time. For extreme latency requirements below one second to first token, pre-compute answers to anticipated queries and serve them from cache, falling back to live generation only for novel queries.

How would you handle a RAG system where the corpus updates hourly?

0:29
Use incremental ingest rather than rebuilds, maintaining document-level hashes or timestamps to detect changes. Queue and batch updates to amortize embedding and indexing cost. Vector databases like Pinecone, Qdrant, and Milvus support online upserts without downtime, so the main work is on the ingest side. Delete or mark stale documents to avoid serving outdated content. For time-sensitive queries like news, add recency boosts or filters using timestamp metadata. Monitor freshness latency from source update to query-time visibility.

How do you ensure RAG quality when the corpus contains contradictory information?

0:28
Acknowledge that retrieval will surface contradictions and design the generator prompt to report conflicting information rather than picking arbitrarily. For authoritative sources, use metadata to encode trust levels and prefer or filter by source during retrieval. For time-sensitive topics, prefer recent documents with timestamp metadata. In some domains, present multiple retrieved passages with citations and let the user judge, rather than collapsing to a single answer. Contradictions often reveal corpus quality issues that should be addressed at the data layer.

What is a good interview answer to "how would you improve retrieval quality"?

0:29
Walk through the systematic process: first measure current retrieval with recall at k and precision at k on a real evaluation set to establish a baseline. Second, inspect failures manually to diagnose whether the issue is chunking, embedding, query mismatch, or lack of reranking. Third, experiment with specific interventions like hybrid search, a better embedding model, a cross-encoder reranker, query rewriting, or chunk strategy changes. Fourth, re-measure on the same evaluation set to validate improvements. Emphasize measurement over intuition at every step.