How to Handle Multi-Region API Failures in LangGraph
Part 2 of 2: Making LangGraph workflows production-ready
US-EAST-1 Went Down. Somebody Got Fired.
US-EAST-1 went down. Somebody got fired. Millions were lost due to a misconfiguration.
This happens every year. AWS, Azure, GCP - everyone has regional failures.
The question isn't "will a region fail?" It's "what happens to your LangGraph workflows when it does?"
Cloudflare's global CDN automatically finds the next healthiest node when a region fails. Your static content stays up.
But your API calls? Your LangGraph workflows calling OpenRouter, Anthropic, OpenAI?
They crash.
What if API failures were automatically rerouted the same way Cloudflare routes traffic - without you doing anything?
This is Part 2 of making LangGraph workflows production-ready. Part 1 covered 429 errors and rate limits. This covers the harder problem: regional failures you can't see coming.
The Problem With "Not Down"
The worst outages aren't the dramatic ones where everything is on fire.
They're the invisible ones.
Your LangGraph workflow isn't crashing. It's just... slow.
Step 3 takes 30 seconds instead of 2. Step 5 times out after 60 seconds. Step 7 eventually succeeds, but took 45 seconds.
No errors in the logs. No 429s. No 500s. Just slow. Painfully slow.
This is a partial outage.
DNS routed you to api.openai.com → US-East. US-East isn't DOWN (health checks pass). US-East is just SLOW (10s response instead of 2s).
Why client-side retries can't help:
from litellm import completion
from tenacity import retry, stop_after_attempt
@retry(stop=stop_after_attempt(3))
def call_openai():
return completion(
model="gpt-4",
messages=[{"role": "user", "content": "..."}]
)
# What happens when US-East is slow:
# Request 1 → US-East (slow, 10s)
# Retry 1 → US-East again (DNS still routes here, slow, 10s)
# Retry 2 → US-East again (still slow, 10s)
# Total: 30 seconds, then maybe succeeds or times out
You can't detect this with:
- Status codes (200 OK, just slow)
- Health checks (endpoint responds, just slowly)
- Retries (keep hitting the same slow region)
The trap: "Everything is working. It's just slow. Must be our code."
It's not your code. It's infrastructure you can't see.
What You're Actually Fighting
Modern LLM APIs deploy across multiple regions:
api.openai.com→ US-East, US-West, EU-Westapi.anthropic.com→ US-East, US-West, EUopenrouter.ai→ Global anycast routing
DNS decides which region you hit. You don't control this. Your code doesn't see it.
When one region degrades:
- Some of your workers hit the slow region (bad experience)
- Some hit healthy regions (everything's fine)
- Your monitoring shows inconsistent latency
- Support tickets: "It's slow for some requests but not others"
You can't fix this with application code alone.
You need:
- Visibility into which regions are healthy
- Ability to route around degraded regions
- Automatic failover without manual intervention
The traditional solution:
- Set up health checks in multiple regions
- Monitor API response times globally
- Manually update DNS or routing rules when problems detected
- Wake up at 3am to make routing decisions
But that's REACTIVE. Problem happens, then you scramble to fix it.
You want PROACTIVE. System detects and routes around problems automatically.
Automatic Regional Rerouting
Here's the pattern: try one region. If it fails or times out, automatically try another.
from langgraph.graph import StateGraph
from ezthrottle import EZThrottle, Step, StepType
ez = EZThrottle(api_key="your_key")
def call_llm_node(state):
result = (
Step(ez)
.url("https://api.openai.com/v1/chat/completions")
.method("POST")
.headers({
"Authorization": f"Bearer {OPENAI_KEY}",
"Content-Type": "application/json"
})
.body({
"model": "gpt-4",
"messages": state["messages"]
})
.type(StepType.PERFORMANCE)
.regions(["iad", "lax", "ord"]) # US-East, US-West, Chicago
.region_policy("fallback") # Try one, reroute on error
.webhooks([{"url": "https://yourapp.com/langgraph-resume"}])
.idempotent_key(f"workflow_{state['workflow_id']}_step_{state['step']}")
.execute()
)
return {"job_id": result["job_id"], "status": "waiting"}
# What happens if US-East (iad) returns 500:
# 0ms: Try iad
# 100ms: iad returns 500
# 100ms: Automatically try lax
# 2100ms: lax succeeds
# Workflow continues from step checkpoint
# Cost: 1 request normally, 2 only on errors
# Your workflow survived a regional failure automatically
No manual intervention. No DNS changes. No on-call pages.
The coordination layer detects the failure and routes around it.
Webhook handler (resumes workflow):
from fastapi import FastAPI, Request, BackgroundTasks
app = FastAPI()
@app.post("/langgraph-resume")
async def resume_workflow(request: Request, background_tasks: BackgroundTasks):
data = await request.json()
workflow_id = data["metadata"]["workflow_id"]
if data["status"] == "success":
llm_response = json.loads(data["response"]["body"])
# Resume in background (don't block webhook)
background_tasks.add_task(
continue_workflow,
workflow_id,
llm_response
)
return {"ok": True}
async def continue_workflow(workflow_id: str, llm_response: dict):
agent.update_state(workflow_id, {
"messages": [..., llm_response],
"status": "complete"
})
await agent.ainvoke({"workflow_id": workflow_id})
Combining Regional Rerouting + Provider Fallbacks
The most resilient pattern: regional rerouting for your primary API, provider fallback if all regions fail.
# Fallback to Anthropic if all OpenAI regions fail
anthropic_fallback = (
Step(ez)
.url("https://api.anthropic.com/v1/messages")
.method("POST")
.headers({
"x-api-key": ANTHROPIC_KEY,
"anthropic-version": "2023-06-01"
})
.body({
"model": "claude-3-5-sonnet-20241022",
"messages": state["messages"]
})
)
result = (
Step(ez)
.url("https://api.openai.com/v1/chat/completions")
.regions(["iad", "lax", "ord"])
.region_policy("fallback") # Try one region, reroute on error
.fallback(anthropic_fallback, trigger_on_error=[500, 502, 503])
.webhooks([{"url": "https://yourapp.com/resume"}])
.execute()
)
# What happens if OpenAI US-East is down:
# Try iad → 500
# Try lax → succeeds
# Anthropic never fires (not needed)
# What happens if ALL OpenAI regions fail:
# Try iad → 500
# Try lax → 500
# Try ord → 500
# Fallback to Anthropic → succeeds
# Cost: Only pays for what's needed (1-4 requests)
# Your workflow survived both regional AND provider failures
This gives you multi-layer resilience without burning through quota.
Advanced: Regional Racing (When Queue Depth Matters)
There's a second pattern - regional racing - where you send the same request to multiple regions simultaneously and take the fastest response.
When this helps:
If you're using a shared coordination layer (like the EZThrottle community instance), different regions might have different queue depths at any moment:
IAD queue: 200 jobs waiting
LAX queue: 50 jobs waiting
ORD queue: 500 jobs waiting
Without racing:
- Routed to IAD (closest)
- Waits behind 200 jobs
- Takes 60+ seconds
With racing:
- Fires to all 3 regions
- LAX has shortest queue
- LAX completes in 15 seconds
- Cancel IAD and ORD (best effort)
The tradeoff:
Racing costs 2-3× in API calls (you might pay for multiple responses even though you only use one, since cancellation is best-effort).
Use racing when:
- Latency is critical (user-facing chat)
- You have quota headroom (not near rate limits)
- Shared infrastructure creates variable queue depths
Use rerouting when:
- Cost matters (quota-constrained)
- Background processing (not user-facing)
- You want reliability without burning quota
Most LangGraph users should start with rerouting, not racing.
What This Means For Operations
Without regional coordination:
- Engineers debug "why is it slow for some workflows?"
- Manual intervention to route around bad regions
- On-call pages when regions degrade
- Post-mortems about "undetectable slowness in US-East"
- Lost agent progress due to timeouts
With regional coordination:
- System automatically routes around slow/failing regions
- No manual intervention needed
- No on-call pages for regional issues
- Dashboard shows: "US-East was down 2-4pm, auto-routed to US-West"
- Workflows survive without restarting
This is Layer 7 automation.
Regional failures become boring. Not "all hands on deck." Just: "System compensated automatically, fix when convenient."
Engineers go home at 5pm. Infrastructure handles it.
The Choice
Build it yourself:
- Health checks across multiple regions
- Routing logic to detect and avoid bad regions
- Coordination layer for automatic rerouting
- Webhook delivery infrastructure
- Ongoing maintenance as your agents scale
Or:
- Focus on your agents
- Let infrastructure handle reliability
- Go home at 5pm
Same choice as Part 1. Different problem.
Getting Started
If you want regional rerouting without building infrastructure:
Free tier: 1M requests/month at ezthrottle.network
Or build it yourself: Architecture details
Related reading:
- Part 1: Stop Losing LangGraph Progress to 429 Errors
- Deep dive: Making Failure Boring Again
- Advanced workflows: Serverless 2.0: RIP Operations
My Mission
US-EAST-1 will go down again. So will US-WEST-2. So will every region.
The question isn't "will it happen?" It's "what happens to your workflows when it does?"
Right now: Engineers debug at 3am. Workflows restart from step 1. Progress lost.
This is what Layer 7 automation means:
Regional failures handled automatically. Traffic reroutes without human decisions. Workflows survive without manual intervention.
Infrastructure that just works. Agents that run for months without pages. Engineers who go home at 5pm.
That's what I'm building toward.
Use it or build it yourself.
Just stop making engineers babysit regional failures.
Find me on X: @RahmiPruitt