Question

What are the biggest sources of latency in a production RAG system?

Accepted Answer

The biggest latency sources are the language model generation call, typically 1 to 10 seconds depending on output length and model, the embedding call at query time, 50 to 500 milliseconds, the vector search itself, usually 10 to 100 milliseconds, and the reranker if present, 50 to 500 milliseconds. Network round trips between services add variable overhead. To reduce latency, stream generation output to the user, run embedding and retrieval in parallel when possible, cache common queries, and use smaller faster models where acceptable.