How to Handle Multi-Region API Failures in LangGraph

US-EAST-1 Went Down. Somebody Got Fired.

US-EAST-1 went down. Somebody got fired. Millions were lost due to a misconfiguration.

This happens every year. AWS, Azure, GCP - everyone has regional failures.

The question isn't "will a region fail?" It's "what happens to your LangGraph workflows when it does?"

Cloudflare's global CDN automatically finds the next healthiest node when a region fails. Your static content stays up.

But your API calls? Your LangGraph workflows calling OpenRouter, Anthropic, OpenAI?

They crash.

What if API failures were automatically rerouted the same way Cloudflare routes traffic - without you doing anything?

This is Part 2 of making LangGraph workflows production-ready. Part 1 covered 429 errors and rate limits. This covers the harder problem: regional failures you can't see coming.

The Problem With "Not Down"

The worst outages aren't the dramatic ones where everything is on fire.

They're the invisible ones.

Your LangGraph workflow isn't crashing. It's just... slow.

Step 3 takes 30 seconds instead of 2. Step 5 times out after 60 seconds. Step 7 eventually succeeds, but took 45 seconds.

No errors in the logs. No 429s. No 500s. Just slow. Painfully slow.

This is a partial outage.

DNS routed you to api.openai.com → US-East. US-East isn't DOWN (health checks pass). US-East is just SLOW (10s response instead of 2s).

Why client-side retries can't help:

from litellm import completion
from tenacity import retry, stop_after_attempt

@retry(stop=stop_after_attempt(3))
def call_openai():
    return completion(
        model="gpt-4",
        messages=[{"role": "user", "content": "..."}]
    )

# What happens when US-East is slow:
# Request 1 → US-East (slow, 10s)
# Retry 1 → US-East again (DNS still routes here, slow, 10s)
# Retry 2 → US-East again (still slow, 10s)
# Total: 30 seconds, then maybe succeeds or times out

You can't detect this with:

Status codes (200 OK, just slow)
Health checks (endpoint responds, just slowly)
Retries (keep hitting the same slow region)

The trap: "Everything is working. It's just slow. Must be our code."

It's not your code. It's infrastructure you can't see.

What You're Actually Fighting

Modern LLM APIs deploy across multiple regions:

api.openai.com → US-East, US-West, EU-West
api.anthropic.com → US-East, US-West, EU
openrouter.ai → Global anycast routing

DNS decides which region you hit. You don't control this. Your code doesn't see it.

When one region degrades:

Some of your workers hit the slow region (bad experience)
Some hit healthy regions (everything's fine)
Your monitoring shows inconsistent latency
Support tickets: "It's slow for some requests but not others"

You can't fix this with application code alone.

You need:

Visibility into which regions are healthy
Ability to route around degraded regions
Automatic failover without manual intervention

The traditional solution:

Set up health checks in multiple regions
Monitor API response times globally
Manually update DNS or routing rules when problems detected
Wake up at 3am to make routing decisions

But that's REACTIVE. Problem happens, then you scramble to fix it.

You want PROACTIVE. System detects and routes around problems automatically.

Automatic Regional Rerouting

Here's the pattern: try one region. If it fails or times out, automatically try another.

from langgraph.graph import StateGraph
from ezthrottle import EZThrottle, Step, StepType

ez = EZThrottle(api_key="your_key")

def call_llm_node(state):
    result = (
        Step(ez)
        .url("https://api.openai.com/v1/chat/completions")
        .method("POST")
        .headers({
            "Authorization": f"Bearer {OPENAI_KEY}",
            "Content-Type": "application/json"
        })
        .body({
            "model": "gpt-4",
            "messages": state["messages"]
        })
        .type(StepType.PERFORMANCE)
        .regions(["iad", "lax", "ord"])  # US-East, US-West, Chicago
        .region_policy("fallback")  # Try one, reroute on error
        .webhooks([{"url": "https://yourapp.com/langgraph-resume"}])
        .idempotent_key(f"workflow_{state['workflow_id']}_step_{state['step']}")
        .execute()
    )

    return {"job_id": result["job_id"], "status": "waiting"}

# What happens if US-East (iad) returns 500:
# 0ms:   Try iad
# 100ms: iad returns 500
# 100ms: Automatically try lax
# 2100ms: lax succeeds
# Workflow continues from step checkpoint

# Cost: 1 request normally, 2 only on errors
# Your workflow survived a regional failure automatically

No manual intervention. No DNS changes. No on-call pages.

The coordination layer detects the failure and routes around it.

Webhook handler (resumes workflow):

from fastapi import FastAPI, Request, BackgroundTasks

app = FastAPI()

@app.post("/langgraph-resume")
async def resume_workflow(request: Request, background_tasks: BackgroundTasks):
    data = await request.json()
    workflow_id = data["metadata"]["workflow_id"]

    if data["status"] == "success":
        llm_response = json.loads(data["response"]["body"])

        # Resume in background (don't block webhook)
        background_tasks.add_task(
            continue_workflow,
            workflow_id,
            llm_response
        )

    return {"ok": True}

async def continue_workflow(workflow_id: str, llm_response: dict):
    agent.update_state(workflow_id, {
        "messages": [..., llm_response],
        "status": "complete"
    })

    await agent.ainvoke({"workflow_id": workflow_id})

Combining Regional Rerouting + Provider Fallbacks

The most resilient pattern: regional rerouting for your primary API, provider fallback if all regions fail.

# Fallback to Anthropic if all OpenAI regions fail
anthropic_fallback = (
    Step(ez)
    .url("https://api.anthropic.com/v1/messages")
    .method("POST")
    .headers({
        "x-api-key": ANTHROPIC_KEY,
        "anthropic-version": "2023-06-01"
    })
    .body({
        "model": "claude-3-5-sonnet-20241022",
        "messages": state["messages"]
    })
)

result = (
    Step(ez)
    .url("https://api.openai.com/v1/chat/completions")
    .regions(["iad", "lax", "ord"])
    .region_policy("fallback")  # Try one region, reroute on error
    .fallback(anthropic_fallback, trigger_on_error=[500, 502, 503])
    .webhooks([{"url": "https://yourapp.com/resume"}])
    .execute()
)

# What happens if OpenAI US-East is down:
# Try iad → 500
# Try lax → succeeds
# Anthropic never fires (not needed)

# What happens if ALL OpenAI regions fail:
# Try iad → 500
# Try lax → 500
# Try ord → 500
# Fallback to Anthropic → succeeds

# Cost: Only pays for what's needed (1-4 requests)
# Your workflow survived both regional AND provider failures

This gives you multi-layer resilience without burning through quota.

Advanced: Regional Racing (When Queue Depth Matters)

There's a second pattern - regional racing - where you send the same request to multiple regions simultaneously and take the fastest response.

When this helps:

If you're using a shared coordination layer (like the EZThrottle community instance), different regions might have different queue depths at any moment:

IAD queue: 200 jobs waiting
LAX queue: 50 jobs waiting
ORD queue: 500 jobs waiting

Without racing:
- Routed to IAD (closest)
- Waits behind 200 jobs
- Takes 60+ seconds

With racing:
- Fires to all 3 regions
- LAX has shortest queue
- LAX completes in 15 seconds
- Cancel IAD and ORD (best effort)

The tradeoff:

Racing costs 2-3× in API calls (you might pay for multiple responses even though you only use one, since cancellation is best-effort).

Use racing when:

Latency is critical (user-facing chat)
You have quota headroom (not near rate limits)
Shared infrastructure creates variable queue depths

Use rerouting when:

Cost matters (quota-constrained)
Background processing (not user-facing)
You want reliability without burning quota

Most LangGraph users should start with rerouting, not racing.

What This Means For Operations

Without regional coordination:

Engineers debug "why is it slow for some workflows?"
Manual intervention to route around bad regions
On-call pages when regions degrade
Post-mortems about "undetectable slowness in US-East"
Lost agent progress due to timeouts

With regional coordination:

System automatically routes around slow/failing regions
No manual intervention needed
No on-call pages for regional issues
Dashboard shows: "US-East was down 2-4pm, auto-routed to US-West"
Workflows survive without restarting

This is Layer 7 automation.

Regional failures become boring. Not "all hands on deck." Just: "System compensated automatically, fix when convenient."

Engineers go home at 5pm. Infrastructure handles it.

The Choice

Build it yourself:

Health checks across multiple regions
Routing logic to detect and avoid bad regions
Coordination layer for automatic rerouting
Webhook delivery infrastructure
Ongoing maintenance as your agents scale

Or:

Focus on your agents
Let infrastructure handle reliability
Go home at 5pm

Same choice as Part 1. Different problem.

Getting Started

If you want regional rerouting without building infrastructure:

SDKs: Python | Node.js | Go

Free tier: 1M requests/month at ezthrottle.network

Or build it yourself: Architecture details

Related reading:

Part 1: Stop Losing LangGraph Progress to 429 Errors
Deep dive: Making Failure Boring Again
Advanced workflows: Serverless 2.0: RIP Operations

My Mission

US-EAST-1 will go down again. So will US-WEST-2. So will every region.

The question isn't "will it happen?" It's "what happens to your workflows when it does?"

Right now: Engineers debug at 3am. Workflows restart from step 1. Progress lost.

This is what Layer 7 automation means:

Regional failures handled automatically. Traffic reroutes without human decisions. Workflows survive without manual intervention.

Infrastructure that just works. Agents that run for months without pages. Engineers who go home at 5pm.

That's what I'm building toward.

Use it or build it yourself.

Just stop making engineers babysit regional failures.

Find me on X: @RahmiPruitt