Integrating LLMs in Production: GPT-4, Claude, and Beyond

Introduction

Integrating LLMs into production systems is more than just API calls. After deploying AI-powered automation systems at Thunder Marketing and building agentic AI architectures at Sazag Infotech, I've learned that the real challenges are reliability, cost management, and orchestration.

This article shares practical lessons from integrating GPT-4, Claude, and Gemini into production systems.

The Multi-Provider Strategy

Relying on a single LLM provider is risky:

Outages happen: OpenAI has had multiple significant outages
Rate limits: Heavy usage can hit limits unexpectedly
Cost variation: Different providers excel at different tasks
Capability differences: Claude handles long contexts better; GPT-4 excels at reasoning

We use a multi-provider approach:

python

1from enum import Enum
2from typing import Protocol
3
4class LLMProvider(Enum):
5    OPENAI = "openai"
6    ANTHROPIC = "anthropic"
7    GOOGLE = "google"
8
9class LLMClient(Protocol):
10    async def complete(self, prompt: str, **kwargs) -> str:
11        ...
12
13class MultiProviderLLM:
14    def __init__(self):
15        self.providers = {
16            LLMProvider.OPENAI: OpenAIClient(),
17            LLMProvider.ANTHROPIC: AnthropicClient(),
18            LLMProvider.GOOGLE: GoogleClient(),
19        }
20        self.fallback_order = [
21            LLMProvider.OPENAI,
22            LLMProvider.ANTHROPIC,
23            LLMProvider.GOOGLE,
24        ]
25
26    async def complete(
27        self,
28        prompt: str,
29        preferred_provider: LLMProvider | None = None,
30        **kwargs
31    ) -> str:
32        providers = (
33            [preferred_provider] + self.fallback_order
34            if preferred_provider
35            else self.fallback_order
36        )
37
38        for provider in providers:
39            try:
40                return await self.providers[provider].complete(prompt, **kwargs)
41            except (RateLimitError, ServiceUnavailable) as e:
42                logger.warning(f"{provider} failed: {e}")
43                continue
44
45        raise AllProvidersFailedError()

Provider Selection: When to Use What

Based on our production experience:

Use Case	Best Provider	Why
Complex reasoning	GPT-4	Best logical capabilities
Long documents	Claude	200K context window
Code generation	GPT-4 / Claude	Both excellent
Fast, cheap tasks	GPT-3.5 / Gemini Flash	Cost-effective
Vision tasks	GPT-4V / Claude 3	Best multimodal

----------	--------------	-----
Long documents	Claude	200K context window
Code generation	GPT-4 / Claude	Both excellent
Fast, cheap tasks	GPT-3.5 / Gemini Flash	Cost-effective
Vision tasks	GPT-4V / Claude 3	Best multimodal

Complex reasoning	GPT-4	Best logical capabilities
Code generation	GPT-4 / Claude	Both excellent
Fast, cheap tasks	GPT-3.5 / Gemini Flash	Cost-effective
Vision tasks	GPT-4V / Claude 3	Best multimodal

Long documents	Claude	200K context window
Fast, cheap tasks	GPT-3.5 / Gemini Flash	Cost-effective
Vision tasks	GPT-4V / Claude 3	Best multimodal

Code generation	GPT-4 / Claude	Both excellent
Vision tasks	GPT-4V / Claude 3	Best multimodal

Fast, cheap tasks	GPT-3.5 / Gemini Flash	Cost-effective

Dynamic Provider Selection

python

1def select_provider(task: Task) -> LLMProvider:
2    if task.requires_vision:
3        return LLMProvider.OPENAI  # GPT-4V
4
5    if task.context_length > 100_000:
6        return LLMProvider.ANTHROPIC  # Claude's long context
7
8    if task.complexity == "simple":
9        return LLMProvider.GOOGLE  # Gemini Flash for cost
10
11    return LLMProvider.OPENAI  # GPT-4 as default

Cost Optimization

LLM costs can explode quickly. Here's how we keep them manageable.

1. Prompt Caching

Many prompts are repeated. Cache them:

python

1import hashlib
2from functools import lru_cache
3
4class CachedLLM:
5    def __init__(self, llm: LLMClient, cache: Redis):
6        self.llm = llm
7        self.cache = cache
8
9    async def complete(self, prompt: str, **kwargs) -> str:
10        # Create cache key from prompt + params
11        cache_key = hashlib.sha256(
12            f"{prompt}:{kwargs}".encode()
13        ).hexdigest()
14
15        # Check cache
16        cached = await self.cache.get(cache_key)
17        if cached:
18            return cached
19
20        # Call LLM
21        result = await self.llm.complete(prompt, **kwargs)
22
23        # Cache result (1 hour TTL)
24        await self.cache.setex(cache_key, 3600, result)
25
26        return result

2. Tiered Model Usage

Use cheaper models when possible:

python

1async def smart_complete(prompt: str, task_type: str) -> str:
2    if task_type in ["classification", "extraction", "simple_qa"]:
3        # Use cheaper model
4        return await gpt35_client.complete(prompt)
5
6    if task_type in ["summarization", "translation"]:
7        # Medium tier
8        return await claude_instant_client.complete(prompt)
9
10    # Complex tasks get GPT-4
11    return await gpt4_client.complete(prompt)

3. Prompt Optimization

Shorter prompts = lower costs:

python

1# Bad: Verbose prompt
2prompt = """
3You are a helpful assistant that extracts information from text.
4Your task is to carefully read the following document and extract
5all the key information including names, dates, and amounts.
6Please be thorough and accurate in your extraction.
7Here is the document:
8{document}
9"""
10
11# Good: Concise prompt
12prompt = """Extract names, dates, and amounts from this document:
13{document}
14
15Return as JSON: {{"names": [], "dates": [], "amounts": []}}"""

This reduced our token usage by 30%.

Orchestration with LangChain

For complex workflows, LangChain provides excellent abstractions:

python

1from langchain.chains import LLMChain, SequentialChain
2from langchain.prompts import PromptTemplate
3
4# Step 1: Extract key points
5extract_chain = LLMChain(
6    llm=llm,
7    prompt=PromptTemplate(
8        input_variables=["document"],
9        template="Extract key points from: {document}"
10    ),
11    output_key="key_points"
12)
13
14# Step 2: Generate summary
15summary_chain = LLMChain(
16    llm=llm,
17    prompt=PromptTemplate(
18        input_variables=["key_points"],
19        template="Summarize these points: {key_points}"
20    ),
21    output_key="summary"
22)
23
24# Combine into pipeline
25pipeline = SequentialChain(
26    chains=[extract_chain, summary_chain],
27    input_variables=["document"],
28    output_variables=["summary"]
29)
30
31result = await pipeline.arun(document=doc)

Agentic AI with LangGraph

For complex decision-making, we use LangGraph:

python

1from langgraph.graph import StateGraph, END
2from typing import TypedDict
3
4class AgentState(TypedDict):
5    task: str
6    research: str
7    plan: str
8    result: str
9
10def should_continue(state: AgentState) -> str:
11    if state.get("result"):
12        return END
13    if state.get("plan"):
14        return "execute"
15    if state.get("research"):
16        return "plan"
17    return "research"
18
19# Build the graph
20workflow = StateGraph(AgentState)
21
22workflow.add_node("research", research_node)
23workflow.add_node("plan", planning_node)
24workflow.add_node("execute", execution_node)
25
26workflow.add_conditional_edges(
27    "research",
28    should_continue,
29    {"plan": "plan", END: END}
30)
31workflow.add_conditional_edges(
32    "plan",
33    should_continue,
34    {"execute": "execute", END: END}
35)
36workflow.add_conditional_edges(
37    "execute",
38    should_continue,
39    {END: END}
40)
41
42workflow.set_entry_point("research")
43agent = workflow.compile()

Reliability Patterns

Structured Outputs

Force consistent outputs with Pydantic:

python

1from langchain.output_parsers import PydanticOutputParser
2from pydantic import BaseModel
3
4class ExtractedData(BaseModel):
5    names: list[str]
6    dates: list[str]
7    amounts: list[float]
8
9parser = PydanticOutputParser(pydantic_object=ExtractedData)
10
11prompt = f"""Extract data from this document:
12{document}
13
14{parser.get_format_instructions()}"""
15
16response = await llm.complete(prompt)
17data = parser.parse(response)  # Validated ExtractedData object

Retry with Backoff

python

1from tenacity import retry, stop_after_attempt, wait_exponential
2
3@retry(
4    stop=stop_after_attempt(3),
5    wait=wait_exponential(multiplier=1, min=4, max=60)
6)
7async def robust_llm_call(prompt: str) -> str:
8    return await llm.complete(prompt)

Monitoring and Observability

python

1from prometheus_client import Counter, Histogram
2
3llm_requests = Counter(
4    'llm_requests_total',
5    'Total LLM requests',
6    ['provider', 'model', 'status']
7)
8
9llm_latency = Histogram(
10    'llm_request_duration_seconds',
11    'LLM request latency',
12    ['provider', 'model']
13)
14
15llm_tokens = Counter(
16    'llm_tokens_total',
17    'Total tokens used',
18    ['provider', 'model', 'type']  # type: prompt/completion
19)

Results

Our LLM integration strategy delivered:

85% task automation accuracy in production
99.5% availability with multi-provider fallbacks
40% cost reduction through caching and tiered models
Sub-2s latency for most requests
Zero vendor lock-in with abstraction layers

Key Takeaways

Multi-provider is essential: Don't depend on a single LLM provider
Match model to task: Use cheaper models for simple tasks
Cache aggressively: Many prompts repeat; cache the results
Structure your outputs: Pydantic parsers ensure consistency
Monitor everything: Track costs, latency, and success rates

LLMs are powerful tools, but production integration requires careful architecture. The patterns in this article have proven reliable across multiple enterprise deployments.

Building with LLMs? Let's connect on LinkedIn or explore my projects on GitHub.

Introduction

The Multi-Provider Strategy

Provider Selection: When to Use What

Dynamic Provider Selection

Cost Optimization

1. Prompt Caching

2. Tiered Model Usage

3. Prompt Optimization

Orchestration with LangChain

Agentic AI with LangGraph

Reliability Patterns

Structured Outputs

Retry with Backoff

Monitoring and Observability

Results

Key Takeaways

Enjoyed this article?