RAGArchitectureEnterprise AI

How to ship RAG in your company without burning the budget

Real architecture for RAG, recommended stack, common mistakes and how to compute ROI before investing.

March 28, 2026 · Dynecron

Shipping Retrieval-Augmented Generation (RAG) sounds expensive and exotic until you sit with an architect and realize 90% of it is boring data plumbing. This post is the architecture we pull out when a client asks “we want AI that answers over our own documents”, without burning budget on what doesn’t move the needle.

The architecture on one page

[Documents / tickets / APIs]


[Ingestion + normalization]   ←  this is what most underestimate


[Contextual chunking]


[Embeddings + vector index]


[Hybrid retrieval (BM25 + vector)]


[Re-ranking]


[LLM with prompt + guardrails]


[Client]

The pretty part of the chain is the LLM at the end, but the ROI is captured in the two top blocks: ingestion and chunking. Get those wrong and no Claude can save you.

Stack we recommend (and why)

  • Embeddings: OpenAI text-embedding-3-large or Cohere embed-multilingual-v3 for Spanish-heavy content. Open-source alternative: bge-m3 if running self-hosted.
  • Vector index: pgvector inside Postgres for most cases. You know it, you operate it, it sits next to your operational DB. Reserve Pinecone or Qdrant for >10M vectors or complex multi-tenant patterns.
  • Hybrid retrieval: combine BM25 (Postgres tsvector or Elasticsearch) with vector similarity and merge with Reciprocal Rank Fusion. Boosts hit-rate 15–30% over vector alone.
  • Re-ranking: Cohere Rerank 3 or bge-reranker-v2-m3. Marginal but important when retrieving top-20 to pass top-5 into the prompt.
  • Generation LLM: Claude 3.5 Sonnet or GPT-4o. Detailed in our model comparison post.

Common mistakes that save us time

  1. Fixed-size chunking that ignores structure. A PDF with tables sliced every 512 tokens splits rows. Invest in contextual chunking: by section, respecting headers and tables.
  2. Not versioning the embeddings. You changed embeddings model and didn’t reindex: inconsistent results for months. Always store the embedding_model_version next to the vector.
  3. Skipping continuous evaluation. Without a small ground-truth query set (even 50 labeled examples), you can’t tell whether a change is an improvement.
  4. Infinite prompts. Stuffing 8,000 tokens of context “just in case” degrades the answer and raises cost. Target: 3–5 relevant chunks, not 20.
  5. Sourceless answers. Enterprise users need to verify. Always return retrieved sources with link/ID.

Operating cost

For a typical 10K queries/month over a 50K-document base:

  • Weekly re-indexing embeddings: ~US$10–30.
  • Infrastructure (Postgres + pgvector on a mid-tier server): ~US$100–250.
  • LLM inference (Sonnet): ~US$100–400.
  • Observability and logs: ~US$30.

Total: US$240–700 per month. One support agent’s salary already exceeds the entire system.

When NOT to recommend RAG

  • If your documents change less than once a quarter and there are <50 of them, fine-tuning or few-shot can be simpler.
  • If the question is mostly about structured data (relational), an agent with tool use over your DB often beats RAG over exported docs.
  • If the case is natural conversation without a knowledge base, RAG is overkill.

Next step

If you want to see what this looks like for your case, bring 20 real questions from your support team and we’ll do a 30-minute session. You’ll walk out with an effort estimate and a realistic monthly operating cost.

Does this resonate with your operation?

In 30 minutes we map whether this applies to your case.