← Back to Blog

How to Handle Multi-Region API Failures in LangGraph

Part 2 of 2: Making LangGraph workflows production-ready

By @RahmiPruitt ·

US-EAST-1 Went Down. Somebody Got Fired.

US-EAST-1 went down. Somebody got fired. Millions were lost due to a misconfiguration.

This happens every year. AWS, Azure, GCP - everyone has regional failures.

The question isn't "will a region fail?" It's "what happens to your LangGraph workflows when it does?"

Cloudflare's global CDN automatically finds the next healthiest node when a region fails. Your static content stays up.

But your API calls? Your LangGraph workflows calling OpenRouter, Anthropic, OpenAI?

They crash.

What if API failures were automatically rerouted the same way Cloudflare routes traffic - without you doing anything?

This is Part 2 of making LangGraph workflows production-ready. Part 1 covered 429 errors and rate limits. This covers the harder problem: regional failures you can't see coming.

The Problem With "Not Down"

The worst outages aren't the dramatic ones where everything is on fire.

They're the invisible ones.

Your LangGraph workflow isn't crashing. It's just... slow.

Step 3 takes 30 seconds instead of 2. Step 5 times out after 60 seconds. Step 7 eventually succeeds, but took 45 seconds.

No errors in the logs. No 429s. No 500s. Just slow. Painfully slow.

This is a partial outage.

DNS routed you to api.openai.com → US-East. US-East isn't DOWN (health checks pass). US-East is just SLOW (10s response instead of 2s).

Why client-side retries can't help:

from litellm import completion
from tenacity import retry, stop_after_attempt

@retry(stop=stop_after_attempt(3))
def call_openai():
    return completion(
        model="gpt-4",
        messages=[{"role": "user", "content": "..."}]
    )

# What happens when US-East is slow:
# Request 1 → US-East (slow, 10s)
# Retry 1 → US-East again (DNS still routes here, slow, 10s)
# Retry 2 → US-East again (still slow, 10s)
# Total: 30 seconds, then maybe succeeds or times out

You can't detect this with:

The trap: "Everything is working. It's just slow. Must be our code."

It's not your code. It's infrastructure you can't see.

What You're Actually Fighting

Modern LLM APIs deploy across multiple regions:

DNS decides which region you hit. You don't control this. Your code doesn't see it.

When one region degrades:

You can't fix this with application code alone.

You need:

  1. Visibility into which regions are healthy
  2. Ability to route around degraded regions
  3. Automatic failover without manual intervention

The traditional solution:

But that's REACTIVE. Problem happens, then you scramble to fix it.

You want PROACTIVE. System detects and routes around problems automatically.

Automatic Regional Rerouting

Here's the pattern: try one region. If it fails or times out, automatically try another.

from langgraph.graph import StateGraph
from ezthrottle import EZThrottle, Step, StepType

ez = EZThrottle(api_key="your_key")

def call_llm_node(state):
    result = (
        Step(ez)
        .url("https://api.openai.com/v1/chat/completions")
        .method("POST")
        .headers({
            "Authorization": f"Bearer {OPENAI_KEY}",
            "Content-Type": "application/json"
        })
        .body({
            "model": "gpt-4",
            "messages": state["messages"]
        })
        .type(StepType.PERFORMANCE)
        .regions(["iad", "lax", "ord"])  # US-East, US-West, Chicago
        .region_policy("fallback")  # Try one, reroute on error
        .webhooks([{"url": "https://yourapp.com/langgraph-resume"}])
        .idempotent_key(f"workflow_{state['workflow_id']}_step_{state['step']}")
        .execute()
    )

    return {"job_id": result["job_id"], "status": "waiting"}

# What happens if US-East (iad) returns 500:
# 0ms:   Try iad
# 100ms: iad returns 500
# 100ms: Automatically try lax
# 2100ms: lax succeeds
# Workflow continues from step checkpoint

# Cost: 1 request normally, 2 only on errors
# Your workflow survived a regional failure automatically

No manual intervention. No DNS changes. No on-call pages.

The coordination layer detects the failure and routes around it.

Webhook handler (resumes workflow):

from fastapi import FastAPI, Request, BackgroundTasks

app = FastAPI()

@app.post("/langgraph-resume")
async def resume_workflow(request: Request, background_tasks: BackgroundTasks):
    data = await request.json()
    workflow_id = data["metadata"]["workflow_id"]

    if data["status"] == "success":
        llm_response = json.loads(data["response"]["body"])

        # Resume in background (don't block webhook)
        background_tasks.add_task(
            continue_workflow,
            workflow_id,
            llm_response
        )

    return {"ok": True}

async def continue_workflow(workflow_id: str, llm_response: dict):
    agent.update_state(workflow_id, {
        "messages": [..., llm_response],
        "status": "complete"
    })

    await agent.ainvoke({"workflow_id": workflow_id})

Combining Regional Rerouting + Provider Fallbacks

The most resilient pattern: regional rerouting for your primary API, provider fallback if all regions fail.

# Fallback to Anthropic if all OpenAI regions fail
anthropic_fallback = (
    Step(ez)
    .url("https://api.anthropic.com/v1/messages")
    .method("POST")
    .headers({
        "x-api-key": ANTHROPIC_KEY,
        "anthropic-version": "2023-06-01"
    })
    .body({
        "model": "claude-3-5-sonnet-20241022",
        "messages": state["messages"]
    })
)

result = (
    Step(ez)
    .url("https://api.openai.com/v1/chat/completions")
    .regions(["iad", "lax", "ord"])
    .region_policy("fallback")  # Try one region, reroute on error
    .fallback(anthropic_fallback, trigger_on_error=[500, 502, 503])
    .webhooks([{"url": "https://yourapp.com/resume"}])
    .execute()
)

# What happens if OpenAI US-East is down:
# Try iad → 500
# Try lax → succeeds
# Anthropic never fires (not needed)

# What happens if ALL OpenAI regions fail:
# Try iad → 500
# Try lax → 500
# Try ord → 500
# Fallback to Anthropic → succeeds

# Cost: Only pays for what's needed (1-4 requests)
# Your workflow survived both regional AND provider failures

This gives you multi-layer resilience without burning through quota.

Advanced: Regional Racing (When Queue Depth Matters)

There's a second pattern - regional racing - where you send the same request to multiple regions simultaneously and take the fastest response.

When this helps:

If you're using a shared coordination layer (like the EZThrottle community instance), different regions might have different queue depths at any moment:

IAD queue: 200 jobs waiting
LAX queue: 50 jobs waiting
ORD queue: 500 jobs waiting

Without racing:
- Routed to IAD (closest)
- Waits behind 200 jobs
- Takes 60+ seconds

With racing:
- Fires to all 3 regions
- LAX has shortest queue
- LAX completes in 15 seconds
- Cancel IAD and ORD (best effort)

The tradeoff:

Racing costs 2-3× in API calls (you might pay for multiple responses even though you only use one, since cancellation is best-effort).

Use racing when:

Use rerouting when:

Most LangGraph users should start with rerouting, not racing.

What This Means For Operations

Without regional coordination:

With regional coordination:

This is Layer 7 automation.

Regional failures become boring. Not "all hands on deck." Just: "System compensated automatically, fix when convenient."

Engineers go home at 5pm. Infrastructure handles it.

The Choice

Build it yourself:

Or:

Same choice as Part 1. Different problem.

Getting Started

If you want regional rerouting without building infrastructure:

SDKs: Python | Node.js | Go

Free tier: 1M requests/month at ezthrottle.network

Or build it yourself: Architecture details

Related reading:

My Mission

US-EAST-1 will go down again. So will US-WEST-2. So will every region.

The question isn't "will it happen?" It's "what happens to your workflows when it does?"

Right now: Engineers debug at 3am. Workflows restart from step 1. Progress lost.

This is what Layer 7 automation means:

Regional failures handled automatically. Traffic reroutes without human decisions. Workflows survive without manual intervention.

Infrastructure that just works. Agents that run for months without pages. Engineers who go home at 5pm.

That's what I'm building toward.

Use it or build it yourself.

Just stop making engineers babysit regional failures.

Find me on X: @RahmiPruitt

Survive Regional Failures Automatically

Start with 1 million free requests. No credit card required.

Start Free →

© 2026 EZThrottle. TCP for APIs. The World's First API Aqueduct™

Built on BEAM by a solo founder who believes engineers deserve to sleep at night.