Introduction
Retrieval-Augmented Generation (RAG) has become the cornerstone of enterprise AI applications. After designing and deploying multiple RAG systems for enterprise clients at Sazag Infotech, I've learned that building a demo is easy—building a production system that delivers consistent, accurate results is an entirely different challenge.
In this article, I'll share the key lessons I learned while improving query accuracy by 40% and building systems that handle real enterprise workloads.
The Gap Between Demo and Production
Most RAG tutorials show you how to:
- Load documents into a vector database
- Embed a query
- Retrieve similar chunks
- Pass them to an LLM
This works great for demos. But in production, you'll face:
- Inconsistent retrieval quality: Sometimes the most relevant chunks aren't the most semantically similar
- Context window limitations: Enterprise documents are long; you can't just stuff everything into the prompt
- Latency requirements: Users expect sub-second responses
- Cost management: GPT-4 calls add up quickly at scale
Lesson 1: Chunking Strategy Matters More Than You Think
The default "split by 500 tokens" approach fails for structured documents. Here's what actually works:
Semantic Chunking
Instead of fixed-size chunks, split documents at natural boundaries:
1from langchain.text_splitter import RecursiveCharacterTextSplitter2
3# Bad: Fixed size chunks4bad_splitter = RecursiveCharacterTextSplitter(5 chunk_size=500,6 chunk_overlap=507)8
9# Better: Respect document structure10good_splitter = RecursiveCharacterTextSplitter(11 chunk_size=1000,12 chunk_overlap=200,13 separators=["\n## ", "\n### ", "\n\n", "\n", " "]14)Document-Aware Chunking
For technical documentation, maintain context by including headers:
1def chunk_with_headers(document):2 chunks = []3 current_header = ""4
5 for section in document.sections:6 if section.is_header:7 current_header = section.text8 else:9 chunk_text = f"{current_header}\n\n{section.text}"10 chunks.append(chunk_text)11
12 return chunksThis simple change improved our retrieval accuracy by 15%.
Lesson 2: Hybrid Search is Non-Negotiable
Pure vector similarity search has a critical flaw: it can miss exact matches. When a user searches for "error code E-4502", semantic search might return chunks about error handling in general, missing the specific error code documentation.
Implementing Hybrid Search
We use a combination of:
- Dense retrieval (vector similarity)
- Sparse retrieval (BM25/keyword matching)
- Reciprocal Rank Fusion to combine results
1from langchain.retrievers import EnsembleRetriever2from langchain.retrievers import BM25Retriever3
4# Create retrievers5vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})6bm25_retriever = BM25Retriever.from_documents(documents)7bm25_retriever.k = 108
9# Combine with ensemble10ensemble_retriever = EnsembleRetriever(11 retrievers=[vector_retriever, bm25_retriever],12 weights=[0.6, 0.4]13)This hybrid approach improved our query accuracy by 25%.
Lesson 3: Vector Database Choice Matters
We evaluated ChromaDB, Pinecone, and Weaviate for different use cases:
| Database | Best For | Trade-offs |
|---|---|---|
| ChromaDB | Prototyping, small datasets | Limited scalability |
| Pinecone | Production, managed infrastructure | Cost at scale |
| Weaviate | Self-hosted, hybrid search | Operational overhead |
| ---------- | ---------- | ------------ |
|---|---|---|
| Pinecone | Production, managed infrastructure | Cost at scale |
| Weaviate | Self-hosted, hybrid search | Operational overhead |
| ChromaDB | Prototyping, small datasets | Limited scalability |
|---|---|---|
| Weaviate | Self-hosted, hybrid search | Operational overhead |
| Pinecone | Production, managed infrastructure | Cost at scale |
|---|
For most enterprise clients, we settled on Pinecone for managed deployments and Weaviate for on-premise requirements.
Optimization: Metadata Filtering
Don't just rely on vector similarity. Use metadata to pre-filter:
1results = vectorstore.similarity_search(2 query,3 k=10,4 filter={5 "document_type": "technical_spec",6 "version": {"$gte": "2.0"},7 "department": user_department8 }9)This reduces the search space and improves both accuracy and latency.
Lesson 4: Query Understanding Changes Everything
Users don't always ask perfect questions. A production RAG system needs query preprocessing:
Query Expansion
1def expand_query(original_query: str, llm) -> list[str]:2 prompt = f"""Given this search query, generate 3 alternative3 phrasings that might help find relevant information:4
5 Query: {original_query}6
7 Return only the alternative queries, one per line."""8
9 alternatives = llm.invoke(prompt).split("\n")10 return [original_query] + alternativesIntent Classification
Before retrieval, classify the query intent:
1intents = ["factual_lookup", "how_to", "troubleshooting", "comparison"]2
3def classify_intent(query: str) -> str:4 # Use a lightweight classifier or LLM5 # This helps select the right retrieval strategy6 passLesson 5: Evaluation is Continuous
You can't improve what you can't measure. We built a continuous evaluation pipeline:
Metrics We Track
- Retrieval Precision@K: Are the retrieved chunks relevant?
- Answer Correctness: Does the final answer match ground truth?
- Faithfulness: Is the answer grounded in retrieved context?
- Latency P95: What's the worst-case response time?
Automated Testing
1test_cases = [2 {3 "query": "What is the maximum file size for uploads?",4 "expected_answer": "50MB",5 "relevant_doc_ids": ["doc_123", "doc_456"]6 },7 # ... more test cases8]9
10def evaluate_rag_system(rag_chain, test_cases):11 results = []12 for case in test_cases:13 response = rag_chain.invoke(case["query"])14 results.append({15 "retrieval_hit": check_retrieval(response, case),16 "answer_correct": check_answer(response, case),17 "latency": response.latency18 })19 return aggregate_metrics(results)Results: 40% Improvement in Query Accuracy
By implementing these lessons, we achieved:
- 40% improvement in query accuracy (measured by answer correctness)
- 60% reduction in "I don't know" responses
- Sub-500ms P95 latency for most queries
- 30% cost reduction through better caching and retrieval
Key Takeaways
- Chunking is foundational: Invest time in document-aware chunking strategies
- Hybrid search is essential: Don't rely on vector similarity alone
- Preprocess queries: Users ask imperfect questions; help them
- Measure everything: Build evaluation into your pipeline from day one
- Iterate continuously: RAG systems improve through constant refinement
Building production RAG systems is challenging, but the payoff—accurate, helpful AI assistants that actually work—is worth the investment.
Have questions about building RAG systems? Feel free to reach out on LinkedIn or GitHub.