How to ship RAG in your company without burning the budget
Real architecture for RAG, recommended stack, common mistakes and how to compute ROI before investing.
Shipping Retrieval-Augmented Generation (RAG) sounds expensive and exotic until you sit with an architect and realize 90% of it is boring data plumbing. This post is the architecture we pull out when a client asks “we want AI that answers over our own documents”, without burning budget on what doesn’t move the needle.
The architecture on one page
[Documents / tickets / APIs]
│
▼
[Ingestion + normalization] ← this is what most underestimate
│
▼
[Contextual chunking]
│
▼
[Embeddings + vector index]
│
▼
[Hybrid retrieval (BM25 + vector)]
│
▼
[Re-ranking]
│
▼
[LLM with prompt + guardrails]
│
▼
[Client]
The pretty part of the chain is the LLM at the end, but the ROI is captured in the two top blocks: ingestion and chunking. Get those wrong and no Claude can save you.
Stack we recommend (and why)
- Embeddings: OpenAI
text-embedding-3-largeor Cohereembed-multilingual-v3for Spanish-heavy content. Open-source alternative:bge-m3if running self-hosted. - Vector index: pgvector inside Postgres for most cases. You know it, you operate it, it sits next to your operational DB. Reserve Pinecone or Qdrant for >10M vectors or complex multi-tenant patterns.
- Hybrid retrieval: combine BM25 (Postgres tsvector or Elasticsearch) with vector similarity and merge with Reciprocal Rank Fusion. Boosts hit-rate 15–30% over vector alone.
- Re-ranking: Cohere Rerank 3 or
bge-reranker-v2-m3. Marginal but important when retrieving top-20 to pass top-5 into the prompt. - Generation LLM: Claude 3.5 Sonnet or GPT-4o. Detailed in our model comparison post.
Common mistakes that save us time
- Fixed-size chunking that ignores structure. A PDF with tables sliced every 512 tokens splits rows. Invest in contextual chunking: by section, respecting headers and tables.
- Not versioning the embeddings. You changed embeddings model and didn’t reindex: inconsistent results for months. Always store the
embedding_model_versionnext to the vector. - Skipping continuous evaluation. Without a small ground-truth query set (even 50 labeled examples), you can’t tell whether a change is an improvement.
- Infinite prompts. Stuffing 8,000 tokens of context “just in case” degrades the answer and raises cost. Target: 3–5 relevant chunks, not 20.
- Sourceless answers. Enterprise users need to verify. Always return retrieved sources with link/ID.
Operating cost
For a typical 10K queries/month over a 50K-document base:
- Weekly re-indexing embeddings: ~US$10–30.
- Infrastructure (Postgres + pgvector on a mid-tier server): ~US$100–250.
- LLM inference (Sonnet): ~US$100–400.
- Observability and logs: ~US$30.
Total: US$240–700 per month. One support agent’s salary already exceeds the entire system.
When NOT to recommend RAG
- If your documents change less than once a quarter and there are <50 of them, fine-tuning or few-shot can be simpler.
- If the question is mostly about structured data (relational), an agent with tool use over your DB often beats RAG over exported docs.
- If the case is natural conversation without a knowledge base, RAG is overkill.
Next step
If you want to see what this looks like for your case, bring 20 real questions from your support team and we’ll do a 30-minute session. You’ll walk out with an effort estimate and a realistic monthly operating cost.