Question

How do you control cost in a production RAG system?

Accepted Answer

Control cost by choosing smaller embedding models and generation models where quality allows, caching to avoid repeat calls, batching embedding requests at ingest, using Matryoshka truncation to reduce storage and search costs, and limiting context window size to the minimum needed per query. Monitor cost per query and per user to identify abusive patterns. At scale, self-hosting embedding and reranker models on GPU infrastructure often beats hosted APIs, while generation usually remains hosted for the frontier model quality.