Back to Insights
AI & Systems8 min read

Building Production RAG Systems: Lessons from the Field

What I learned designing RAG architectures for enterprise clients—from vector database optimization to achieving 40% better query accuracy.

AP

Anshuman Parmar

December 2025

Building Production RAG Systems: Lessons from the Field

Introduction

Retrieval-Augmented Generation (RAG) has become the cornerstone of enterprise AI applications. After designing and deploying multiple RAG systems for enterprise clients at Sazag Infotech, I've learned that building a demo is easy—building a production system that delivers consistent, accurate results is an entirely different challenge.

In this article, I'll share the key lessons I learned while improving query accuracy by 40% and building systems that handle real enterprise workloads.

The Gap Between Demo and Production

Most RAG tutorials show you how to:

  1. Load documents into a vector database
  2. Embed a query
  3. Retrieve similar chunks
  4. Pass them to an LLM

This works great for demos. But in production, you'll face:

  • Inconsistent retrieval quality: Sometimes the most relevant chunks aren't the most semantically similar
  • Context window limitations: Enterprise documents are long; you can't just stuff everything into the prompt
  • Latency requirements: Users expect sub-second responses
  • Cost management: GPT-4 calls add up quickly at scale

Lesson 1: Chunking Strategy Matters More Than You Think

The default "split by 500 tokens" approach fails for structured documents. Here's what actually works:

Semantic Chunking

Instead of fixed-size chunks, split documents at natural boundaries:

python
1from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3# Bad: Fixed size chunks
4bad_splitter = RecursiveCharacterTextSplitter(
5 chunk_size=500,
6 chunk_overlap=50
7)
8
9# Better: Respect document structure
10good_splitter = RecursiveCharacterTextSplitter(
11 chunk_size=1000,
12 chunk_overlap=200,
13 separators=["\n## ", "\n### ", "\n\n", "\n", " "]
14)

Document-Aware Chunking

For technical documentation, maintain context by including headers:

python
1def chunk_with_headers(document):
2 chunks = []
3 current_header = ""
4
5 for section in document.sections:
6 if section.is_header:
7 current_header = section.text
8 else:
9 chunk_text = f"{current_header}\n\n{section.text}"
10 chunks.append(chunk_text)
11
12 return chunks

This simple change improved our retrieval accuracy by 15%.

Lesson 2: Hybrid Search is Non-Negotiable

Pure vector similarity search has a critical flaw: it can miss exact matches. When a user searches for "error code E-4502", semantic search might return chunks about error handling in general, missing the specific error code documentation.

We use a combination of:

  1. Dense retrieval (vector similarity)
  2. Sparse retrieval (BM25/keyword matching)
  3. Reciprocal Rank Fusion to combine results
python
1from langchain.retrievers import EnsembleRetriever
2from langchain.retrievers import BM25Retriever
3
4# Create retrievers
5vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
6bm25_retriever = BM25Retriever.from_documents(documents)
7bm25_retriever.k = 10
8
9# Combine with ensemble
10ensemble_retriever = EnsembleRetriever(
11 retrievers=[vector_retriever, bm25_retriever],
12 weights=[0.6, 0.4]
13)

This hybrid approach improved our query accuracy by 25%.

Lesson 3: Vector Database Choice Matters

We evaluated ChromaDB, Pinecone, and Weaviate for different use cases:

DatabaseBest ForTrade-offs
ChromaDBPrototyping, small datasetsLimited scalability
PineconeProduction, managed infrastructureCost at scale
WeaviateSelf-hosted, hybrid searchOperational overhead
--------------------------------
PineconeProduction, managed infrastructureCost at scale
WeaviateSelf-hosted, hybrid searchOperational overhead
ChromaDBPrototyping, small datasetsLimited scalability
WeaviateSelf-hosted, hybrid searchOperational overhead
PineconeProduction, managed infrastructureCost at scale

For most enterprise clients, we settled on Pinecone for managed deployments and Weaviate for on-premise requirements.

Optimization: Metadata Filtering

Don't just rely on vector similarity. Use metadata to pre-filter:

python
1results = vectorstore.similarity_search(
2 query,
3 k=10,
4 filter={
5 "document_type": "technical_spec",
6 "version": {"$gte": "2.0"},
7 "department": user_department
8 }
9)

This reduces the search space and improves both accuracy and latency.

Lesson 4: Query Understanding Changes Everything

Users don't always ask perfect questions. A production RAG system needs query preprocessing:

Query Expansion

python
1def expand_query(original_query: str, llm) -> list[str]:
2 prompt = f"""Given this search query, generate 3 alternative
3 phrasings that might help find relevant information:
4
5 Query: {original_query}
6
7 Return only the alternative queries, one per line."""
8
9 alternatives = llm.invoke(prompt).split("\n")
10 return [original_query] + alternatives

Intent Classification

Before retrieval, classify the query intent:

python
1intents = ["factual_lookup", "how_to", "troubleshooting", "comparison"]
2
3def classify_intent(query: str) -> str:
4 # Use a lightweight classifier or LLM
5 # This helps select the right retrieval strategy
6 pass

Lesson 5: Evaluation is Continuous

You can't improve what you can't measure. We built a continuous evaluation pipeline:

Metrics We Track

  1. Retrieval Precision@K: Are the retrieved chunks relevant?
  2. Answer Correctness: Does the final answer match ground truth?
  3. Faithfulness: Is the answer grounded in retrieved context?
  4. Latency P95: What's the worst-case response time?

Automated Testing

python
1test_cases = [
2 {
3 "query": "What is the maximum file size for uploads?",
4 "expected_answer": "50MB",
5 "relevant_doc_ids": ["doc_123", "doc_456"]
6 },
7 # ... more test cases
8]
9
10def evaluate_rag_system(rag_chain, test_cases):
11 results = []
12 for case in test_cases:
13 response = rag_chain.invoke(case["query"])
14 results.append({
15 "retrieval_hit": check_retrieval(response, case),
16 "answer_correct": check_answer(response, case),
17 "latency": response.latency
18 })
19 return aggregate_metrics(results)

Results: 40% Improvement in Query Accuracy

By implementing these lessons, we achieved:

  • 40% improvement in query accuracy (measured by answer correctness)
  • 60% reduction in "I don't know" responses
  • Sub-500ms P95 latency for most queries
  • 30% cost reduction through better caching and retrieval

Key Takeaways

  1. Chunking is foundational: Invest time in document-aware chunking strategies
  2. Hybrid search is essential: Don't rely on vector similarity alone
  3. Preprocess queries: Users ask imperfect questions; help them
  4. Measure everything: Build evaluation into your pipeline from day one
  5. Iterate continuously: RAG systems improve through constant refinement

Building production RAG systems is challenging, but the payoff—accurate, helpful AI assistants that actually work—is worth the investment.


Have questions about building RAG systems? Feel free to reach out on LinkedIn or GitHub.

AP

WRITTEN BY

Anshuman Parmar

Senior Full Stack Developer specializing in AI systems, browser automation, and scalable web applications. Building production-grade solutions that deliver measurable business impact.

Enjoyed this article?

Explore more insights on AI, automation, and system design.

View All Insights