Building Production RAG Systems: Lessons from the Field

Introduction

Retrieval-Augmented Generation (RAG) has become the cornerstone of enterprise AI applications. After designing and deploying multiple RAG systems for enterprise clients at Sazag Infotech, I've learned that building a demo is easy—building a production system that delivers consistent, accurate results is an entirely different challenge.

In this article, I'll share the key lessons I learned while improving query accuracy by 40% and building systems that handle real enterprise workloads.

The Gap Between Demo and Production

Most RAG tutorials show you how to:

Load documents into a vector database
Embed a query
Retrieve similar chunks
Pass them to an LLM

This works great for demos. But in production, you'll face:

Inconsistent retrieval quality: Sometimes the most relevant chunks aren't the most semantically similar
Context window limitations: Enterprise documents are long; you can't just stuff everything into the prompt
Latency requirements: Users expect sub-second responses
Cost management: GPT-4 calls add up quickly at scale

Lesson 1: Chunking Strategy Matters More Than You Think

The default "split by 500 tokens" approach fails for structured documents. Here's what actually works:

Semantic Chunking

Instead of fixed-size chunks, split documents at natural boundaries:

python

1from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3# Bad: Fixed size chunks
4bad_splitter = RecursiveCharacterTextSplitter(
5    chunk_size=500,
6    chunk_overlap=50
7)
8
9# Better: Respect document structure
10good_splitter = RecursiveCharacterTextSplitter(
11    chunk_size=1000,
12    chunk_overlap=200,
13    separators=["\n## ", "\n### ", "\n\n", "\n", " "]
14)

Document-Aware Chunking

For technical documentation, maintain context by including headers:

python

1def chunk_with_headers(document):
2    chunks = []
3    current_header = ""
4
5    for section in document.sections:
6        if section.is_header:
7            current_header = section.text
8        else:
9            chunk_text = f"{current_header}\n\n{section.text}"
10            chunks.append(chunk_text)
11
12    return chunks

This simple change improved our retrieval accuracy by 15%.

Lesson 2: Hybrid Search is Non-Negotiable

Pure vector similarity search has a critical flaw: it can miss exact matches. When a user searches for "error code E-4502", semantic search might return chunks about error handling in general, missing the specific error code documentation.

Implementing Hybrid Search

We use a combination of:

Dense retrieval (vector similarity)
Sparse retrieval (BM25/keyword matching)
Reciprocal Rank Fusion to combine results

python

1from langchain.retrievers import EnsembleRetriever
2from langchain.retrievers import BM25Retriever
3
4# Create retrievers
5vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
6bm25_retriever = BM25Retriever.from_documents(documents)
7bm25_retriever.k = 10
8
9# Combine with ensemble
10ensemble_retriever = EnsembleRetriever(
11    retrievers=[vector_retriever, bm25_retriever],
12    weights=[0.6, 0.4]
13)

This hybrid approach improved our query accuracy by 25%.

Lesson 3: Vector Database Choice Matters

We evaluated ChromaDB, Pinecone, and Weaviate for different use cases:

Database	Best For	Trade-offs
ChromaDB	Prototyping, small datasets	Limited scalability
Pinecone	Production, managed infrastructure	Cost at scale
Weaviate	Self-hosted, hybrid search	Operational overhead

----------	----------	------------
Pinecone	Production, managed infrastructure	Cost at scale
Weaviate	Self-hosted, hybrid search	Operational overhead

ChromaDB	Prototyping, small datasets	Limited scalability
Weaviate	Self-hosted, hybrid search	Operational overhead

Pinecone	Production, managed infrastructure	Cost at scale

For most enterprise clients, we settled on Pinecone for managed deployments and Weaviate for on-premise requirements.

Optimization: Metadata Filtering

Don't just rely on vector similarity. Use metadata to pre-filter:

python

1results = vectorstore.similarity_search(
2    query,
3    k=10,
4    filter={
5        "document_type": "technical_spec",
6        "version": {"$gte": "2.0"},
7        "department": user_department
8    }
9)

This reduces the search space and improves both accuracy and latency.

Lesson 4: Query Understanding Changes Everything

Users don't always ask perfect questions. A production RAG system needs query preprocessing:

Query Expansion

python

1def expand_query(original_query: str, llm) -> list[str]:
2    prompt = f"""Given this search query, generate 3 alternative
3    phrasings that might help find relevant information:
4
5    Query: {original_query}
6
7    Return only the alternative queries, one per line."""
8
9    alternatives = llm.invoke(prompt).split("\n")
10    return [original_query] + alternatives

Intent Classification

Before retrieval, classify the query intent:

python

1intents = ["factual_lookup", "how_to", "troubleshooting", "comparison"]
2
3def classify_intent(query: str) -> str:
4    # Use a lightweight classifier or LLM
5    # This helps select the right retrieval strategy
6    pass

Lesson 5: Evaluation is Continuous

You can't improve what you can't measure. We built a continuous evaluation pipeline:

Metrics We Track

Retrieval Precision@K: Are the retrieved chunks relevant?
Answer Correctness: Does the final answer match ground truth?
Faithfulness: Is the answer grounded in retrieved context?
Latency P95: What's the worst-case response time?

Automated Testing

python

1test_cases = [
2    {
3        "query": "What is the maximum file size for uploads?",
4        "expected_answer": "50MB",
5        "relevant_doc_ids": ["doc_123", "doc_456"]
6    },
7    # ... more test cases
8]
9
10def evaluate_rag_system(rag_chain, test_cases):
11    results = []
12    for case in test_cases:
13        response = rag_chain.invoke(case["query"])
14        results.append({
15            "retrieval_hit": check_retrieval(response, case),
16            "answer_correct": check_answer(response, case),
17            "latency": response.latency
18        })
19    return aggregate_metrics(results)

Results: 40% Improvement in Query Accuracy

By implementing these lessons, we achieved:

40% improvement in query accuracy (measured by answer correctness)
60% reduction in "I don't know" responses
Sub-500ms P95 latency for most queries
30% cost reduction through better caching and retrieval

Key Takeaways

Chunking is foundational: Invest time in document-aware chunking strategies
Hybrid search is essential: Don't rely on vector similarity alone
Preprocess queries: Users ask imperfect questions; help them
Measure everything: Build evaluation into your pipeline from day one
Iterate continuously: RAG systems improve through constant refinement

Building production RAG systems is challenging, but the payoff—accurate, helpful AI assistants that actually work—is worth the investment.

Have questions about building RAG systems? Feel free to reach out on LinkedIn or GitHub.

Introduction

The Gap Between Demo and Production

Lesson 1: Chunking Strategy Matters More Than You Think

Semantic Chunking

Document-Aware Chunking

Lesson 2: Hybrid Search is Non-Negotiable

Implementing Hybrid Search

Lesson 3: Vector Database Choice Matters

Optimization: Metadata Filtering

Lesson 4: Query Understanding Changes Everything

Query Expansion

Intent Classification

Lesson 5: Evaluation is Continuous

Metrics We Track

Automated Testing

Results: 40% Improvement in Query Accuracy

Key Takeaways

Enjoyed this article?