Introduction
Integrating LLMs into production systems is more than just API calls. After deploying AI-powered automation systems at Thunder Marketing and building agentic AI architectures at Sazag Infotech, I've learned that the real challenges are reliability, cost management, and orchestration.
This article shares practical lessons from integrating GPT-4, Claude, and Gemini into production systems.
The Multi-Provider Strategy
Relying on a single LLM provider is risky:
- Outages happen: OpenAI has had multiple significant outages
- Rate limits: Heavy usage can hit limits unexpectedly
- Cost variation: Different providers excel at different tasks
- Capability differences: Claude handles long contexts better; GPT-4 excels at reasoning
We use a multi-provider approach:
1from enum import Enum2from typing import Protocol3
4class LLMProvider(Enum):5 OPENAI = "openai"6 ANTHROPIC = "anthropic"7 GOOGLE = "google"8
9class LLMClient(Protocol):10 async def complete(self, prompt: str, **kwargs) -> str:11 ...12
13class MultiProviderLLM:14 def __init__(self):15 self.providers = {16 LLMProvider.OPENAI: OpenAIClient(),17 LLMProvider.ANTHROPIC: AnthropicClient(),18 LLMProvider.GOOGLE: GoogleClient(),19 }20 self.fallback_order = [21 LLMProvider.OPENAI,22 LLMProvider.ANTHROPIC,23 LLMProvider.GOOGLE,24 ]25
26 async def complete(27 self,28 prompt: str,29 preferred_provider: LLMProvider | None = None,30 **kwargs31 ) -> str:32 providers = (33 [preferred_provider] + self.fallback_order34 if preferred_provider35 else self.fallback_order36 )37
38 for provider in providers:39 try:40 return await self.providers[provider].complete(prompt, **kwargs)41 except (RateLimitError, ServiceUnavailable) as e:42 logger.warning(f"{provider} failed: {e}")43 continue44
45 raise AllProvidersFailedError()Provider Selection: When to Use What
Based on our production experience:
| Use Case | Best Provider | Why |
|---|---|---|
| Complex reasoning | GPT-4 | Best logical capabilities |
| Long documents | Claude | 200K context window |
| Code generation | GPT-4 / Claude | Both excellent |
| Fast, cheap tasks | GPT-3.5 / Gemini Flash | Cost-effective |
| Vision tasks | GPT-4V / Claude 3 | Best multimodal |
| ---------- | -------------- | ----- |
|---|---|---|
| Long documents | Claude | 200K context window |
| Code generation | GPT-4 / Claude | Both excellent |
| Fast, cheap tasks | GPT-3.5 / Gemini Flash | Cost-effective |
| Vision tasks | GPT-4V / Claude 3 | Best multimodal |
| Complex reasoning | GPT-4 | Best logical capabilities |
|---|---|---|
| Code generation | GPT-4 / Claude | Both excellent |
| Fast, cheap tasks | GPT-3.5 / Gemini Flash | Cost-effective |
| Vision tasks | GPT-4V / Claude 3 | Best multimodal |
| Long documents | Claude | 200K context window |
|---|---|---|
| Fast, cheap tasks | GPT-3.5 / Gemini Flash | Cost-effective |
| Vision tasks | GPT-4V / Claude 3 | Best multimodal |
| Code generation | GPT-4 / Claude | Both excellent |
|---|---|---|
| Vision tasks | GPT-4V / Claude 3 | Best multimodal |
| Fast, cheap tasks | GPT-3.5 / Gemini Flash | Cost-effective |
|---|
Dynamic Provider Selection
1def select_provider(task: Task) -> LLMProvider:2 if task.requires_vision:3 return LLMProvider.OPENAI # GPT-4V4
5 if task.context_length > 100_000:6 return LLMProvider.ANTHROPIC # Claude's long context7
8 if task.complexity == "simple":9 return LLMProvider.GOOGLE # Gemini Flash for cost10
11 return LLMProvider.OPENAI # GPT-4 as defaultCost Optimization
LLM costs can explode quickly. Here's how we keep them manageable.
1. Prompt Caching
Many prompts are repeated. Cache them:
1import hashlib2from functools import lru_cache3
4class CachedLLM:5 def __init__(self, llm: LLMClient, cache: Redis):6 self.llm = llm7 self.cache = cache8
9 async def complete(self, prompt: str, **kwargs) -> str:10 # Create cache key from prompt + params11 cache_key = hashlib.sha256(12 f"{prompt}:{kwargs}".encode()13 ).hexdigest()14
15 # Check cache16 cached = await self.cache.get(cache_key)17 if cached:18 return cached19
20 # Call LLM21 result = await self.llm.complete(prompt, **kwargs)22
23 # Cache result (1 hour TTL)24 await self.cache.setex(cache_key, 3600, result)25
26 return result2. Tiered Model Usage
Use cheaper models when possible:
1async def smart_complete(prompt: str, task_type: str) -> str:2 if task_type in ["classification", "extraction", "simple_qa"]:3 # Use cheaper model4 return await gpt35_client.complete(prompt)5
6 if task_type in ["summarization", "translation"]:7 # Medium tier8 return await claude_instant_client.complete(prompt)9
10 # Complex tasks get GPT-411 return await gpt4_client.complete(prompt)3. Prompt Optimization
Shorter prompts = lower costs:
1# Bad: Verbose prompt2prompt = """3You are a helpful assistant that extracts information from text.4Your task is to carefully read the following document and extract5all the key information including names, dates, and amounts.6Please be thorough and accurate in your extraction.7Here is the document:8{document}9"""10
11# Good: Concise prompt12prompt = """Extract names, dates, and amounts from this document:13{document}14
15Return as JSON: {{"names": [], "dates": [], "amounts": []}}"""This reduced our token usage by 30%.
Orchestration with LangChain
For complex workflows, LangChain provides excellent abstractions:
1from langchain.chains import LLMChain, SequentialChain2from langchain.prompts import PromptTemplate3
4# Step 1: Extract key points5extract_chain = LLMChain(6 llm=llm,7 prompt=PromptTemplate(8 input_variables=["document"],9 template="Extract key points from: {document}"10 ),11 output_key="key_points"12)13
14# Step 2: Generate summary15summary_chain = LLMChain(16 llm=llm,17 prompt=PromptTemplate(18 input_variables=["key_points"],19 template="Summarize these points: {key_points}"20 ),21 output_key="summary"22)23
24# Combine into pipeline25pipeline = SequentialChain(26 chains=[extract_chain, summary_chain],27 input_variables=["document"],28 output_variables=["summary"]29)30
31result = await pipeline.arun(document=doc)Agentic AI with LangGraph
For complex decision-making, we use LangGraph:
1from langgraph.graph import StateGraph, END2from typing import TypedDict3
4class AgentState(TypedDict):5 task: str6 research: str7 plan: str8 result: str9
10def should_continue(state: AgentState) -> str:11 if state.get("result"):12 return END13 if state.get("plan"):14 return "execute"15 if state.get("research"):16 return "plan"17 return "research"18
19# Build the graph20workflow = StateGraph(AgentState)21
22workflow.add_node("research", research_node)23workflow.add_node("plan", planning_node)24workflow.add_node("execute", execution_node)25
26workflow.add_conditional_edges(27 "research",28 should_continue,29 {"plan": "plan", END: END}30)31workflow.add_conditional_edges(32 "plan",33 should_continue,34 {"execute": "execute", END: END}35)36workflow.add_conditional_edges(37 "execute",38 should_continue,39 {END: END}40)41
42workflow.set_entry_point("research")43agent = workflow.compile()Reliability Patterns
Structured Outputs
Force consistent outputs with Pydantic:
1from langchain.output_parsers import PydanticOutputParser2from pydantic import BaseModel3
4class ExtractedData(BaseModel):5 names: list[str]6 dates: list[str]7 amounts: list[float]8
9parser = PydanticOutputParser(pydantic_object=ExtractedData)10
11prompt = f"""Extract data from this document:12{document}13
14{parser.get_format_instructions()}"""15
16response = await llm.complete(prompt)17data = parser.parse(response) # Validated ExtractedData objectRetry with Backoff
1from tenacity import retry, stop_after_attempt, wait_exponential2
3@retry(4 stop=stop_after_attempt(3),5 wait=wait_exponential(multiplier=1, min=4, max=60)6)7async def robust_llm_call(prompt: str) -> str:8 return await llm.complete(prompt)Monitoring and Observability
1from prometheus_client import Counter, Histogram2
3llm_requests = Counter(4 'llm_requests_total',5 'Total LLM requests',6 ['provider', 'model', 'status']7)8
9llm_latency = Histogram(10 'llm_request_duration_seconds',11 'LLM request latency',12 ['provider', 'model']13)14
15llm_tokens = Counter(16 'llm_tokens_total',17 'Total tokens used',18 ['provider', 'model', 'type'] # type: prompt/completion19)Results
Our LLM integration strategy delivered:
- 85% task automation accuracy in production
- 99.5% availability with multi-provider fallbacks
- 40% cost reduction through caching and tiered models
- Sub-2s latency for most requests
- Zero vendor lock-in with abstraction layers
Key Takeaways
- Multi-provider is essential: Don't depend on a single LLM provider
- Match model to task: Use cheaper models for simple tasks
- Cache aggressively: Many prompts repeat; cache the results
- Structure your outputs: Pydantic parsers ensure consistency
- Monitor everything: Track costs, latency, and success rates
LLMs are powerful tools, but production integration requires careful architecture. The patterns in this article have proven reliable across multiple enterprise deployments.
Building with LLMs? Let's connect on LinkedIn or explore my projects on GitHub.