Stop Losing LangGraph Progress to 429 Errors
How to Scale Agents Without Burning Out Engineers
Part 1 of 2: Making LangGraph workflows production-ready
Why Your Agents Don't Scale
I've seen genuinely nice people become assholes because they get paged every weekend. I've seen organizations play Hunger Games when leadership asks who caused the post-mortem.
The reason your agents don't scale is the same reason serverless doesn't scale.
Serverless doesn't mean operationless.
You still need retry logic. You still need rate limit handling. You still need coordination across workers. You still need someone to wake up at 3am when it breaks.
LangGraph handles state management, workflow orchestration, and complex agent logic beautifully.
But when OpenRouter returns 429 at step 7 of your workflow, LangGraph can't help you. Your workflow crashes. You restart from step 1. Your engineers debug why 100 workers created a retry storm.
At some point, someone suggests: "Let's build a queue."
The Queue You'll Eventually Build
If you want agents to scale without churning through engineers, you'll need some mechanism for queuing. Not optional. It's a real infrastructure problem.
The right architecture is queue-per-URL. Each external dependency gets its own queue with its own rate limits. Stripe gets 100 RPS. OpenAI gets 50 RPS. They don't interfere with each other.
This is doable. It's not magic. It's ~2000 lines of code plus distributed state management plus health checking plus monitoring.
But here's the part nobody mentions: it's not the time to write it that kills you. It's the ongoing maintenance.
Those queues need to scale as your business grows. They need debugging when they break. They need someone on-call when they fail at 2am. They need a team.
You can build this. Many companies do.
But now you're in the infrastructure business, not the AI agent business.
Netflix didn't become Netflix by managing data centers. They specialized in streaming video and let AWS handle infrastructure.
Same principle here.
What You Have Today
Here's what most LangGraph workflows look like:
from langgraph.graph import StateGraph
from litellm import completion
def call_llm_node(state):
try:
response = completion(
model="anthropic/claude-3.5-sonnet",
messages=state["messages"],
fallbacks=["openai/gpt-4"]
)
return {"messages": state["messages"] + [response]}
except RateLimitError:
raise # Workflow crashes
What happens when OpenRouter rate limits at step 7:
- Sequential fallback: Claude times out (5s), THEN try GPT-4 (5s) = 10s wasted
- Limited to your account: All fallbacks hit YOUR quota
- No coordination: 100 workers retry independently (retry storm)
- Progress lost: Restart from step 1
This works fine at 10 requests/day. It breaks at 1000 requests/day.
What You Actually Want
Multi-provider, multi-account fallbacks that race instead of waiting sequentially.
When your primary OpenRouter account hits rate limits, you want the system to automatically try:
- Your backup OpenRouter account
- Direct Anthropic API
- Direct OpenAI API
- Whichever other providers you've configured
All racing simultaneously. Fastest response wins.
Coordinated retries across all your workers so 100 instances don't create a retry storm.
Webhook-based resumption so your LangGraph workflow doesn't block waiting for responses.
Idempotent execution so a 429 at step 7 resumes at step 7, not step 1.
Here's what that looks like:
def call_llm_node(state):
result = (
Step(ez)
.url("https://openrouter.ai/api/v1/chat/completions")
.method("POST")
.headers({"Authorization": f"Bearer {OPENROUTER_KEY}"})
.body({
"model": "anthropic/claude-3.5-sonnet",
"messages": state["messages"]
})
.type(StepType.PERFORMANCE)
.fallback_on_error([429, 500, 503])
.webhooks([{"url": "https://yourapp.com/langgraph-resume"}])
.idempotent_key(f"workflow_{state['workflow_id']}_step_{state['step']}")
.execute()
)
return {"job_id": result["job_id"], "status": "waiting"}
Behind the scenes, this coordinates retries across all workers, races multiple providers and accounts, and delivers results via webhook when ready.
You could build this coordination yourself. Or you could ship agents.
Fallback Racing
Sequential fallbacks waste time. You want racing.
# Define fallback chain
anthropic = Step(ez).url("https://api.anthropic.com/v1/messages")
openai = (
Step(ez)
.url("https://api.openai.com/v1/chat/completions")
.fallback(anthropic, trigger_on_timeout=3000) # Race after 3s
)
result = (
Step(ez)
.url("https://openrouter.ai/...")
.fallback(openai, trigger_on_error=[429, 500])
.execute()
)
Timeline when OpenRouter returns 429:
0ms: OpenRouter tries
100ms: OpenRouter 429 → OpenAI fallback fires
100ms: OpenRouter retrying + OpenAI both racing
3100ms: OpenAI slow → Anthropic fires
3100ms: All three racing
3200ms: Anthropic wins, others cancelled
All providers race after their triggers fire. Fastest wins.
You can't do this with client-side retries. They're sequential by design.
Resuming Workflows with Webhooks
Your workflow doesn't block. It continues, and webhooks resume it when ready.
from fastapi import FastAPI, Request, BackgroundTasks
app = FastAPI()
@app.post("/langgraph-resume")
async def resume_workflow(request: Request, background_tasks: BackgroundTasks):
data = await request.json()
workflow_id = data["metadata"]["workflow_id"]
if data["status"] == "success":
llm_response = json.loads(data["response"]["body"])
# Resume in background (don't block webhook)
background_tasks.add_task(
continue_workflow,
workflow_id,
llm_response
)
return {"ok": True}
async def continue_workflow(workflow_id: str, llm_response: dict):
# Update LangGraph state
agent.update_state(workflow_id, {
"messages": [..., llm_response],
"status": "complete"
})
# Continue from next step
await agent.ainvoke({"workflow_id": workflow_id})
The pattern:
- Submit to coordination layer → returns immediately
- Workflow continues with other work
- Webhook fires when LLM responds
- Resume workflow from checkpoint
No blocking. No retry storms. No lost progress.
What the Industry Actually Needs
The industry needs agents that can be trusted to run for months and years without human intervention.
That means Layer 7 (HTTP) needs to be automated. Retries, rate limits, failover - all handled at the infrastructure layer, not in application code.
Right now, most teams write retry logic in every service. When it breaks, engineers get paged. When traffic spikes, retry storms happen. When providers have outages, everything falls over.
This doesn't scale. Not the technology - the people.
You can build coordination infrastructure yourself. You can dedicate a team to maintaining it. Some companies do.
Or you can treat it like AWS treats compute: infrastructure you don't manage.
The Choice
Build it yourself:
- Queue per URL/dependency (the right architecture)
- Distributed state coordination
- Health checking and failover
- Ongoing maintenance as you scale
- A team to own it
Or:
- Focus on agents
- Let infrastructure handle reliability
- Go home at 5pm
Netflix chose streaming over data centers. What will you choose?
Getting Started
If you want the patterns above without building infrastructure:
Free tier: 1M requests/month at ezthrottle.network
The coordination layer handles 20 accounts across 4 providers working like one pool.
Or build it yourself: Architecture details
My Mission
I'm working to help the industry write scalable serverless software without needing to turn on more servers and with minimal operations.
Engineers shouldn't wake up at 3am because OpenRouter rate limited. They shouldn't lose weekends debugging retry storms. They shouldn't sacrifice time with family maintaining infrastructure that leadership calls "good enough."
Layer 7 should be automated. Agents should run for months without human intervention. Engineers should go home at 5pm.
That's what I'm building toward.
Use it or don't. Build it yourself or don't.
But please: stop letting infrastructure steal your time.
Find me on X: @RahmiPruitt
Coming next: Part 2 - Surviving Regional Failures and Partial Outages
🦞