Retry and Fallback Strategies

LLM APIs are inherently unreliable: rate limits, timeouts, server errors, and capacity issues are facts of life. A production application must handle these failures gracefully. This lesson covers exponential backoff, circuit breakers, model fallback chains, and designing for degraded operation.

Exponential Backoff

When an LLM API returns a transient error (429 rate limit, 500 server error, or timeout), retrying immediately usually makes things worse. Exponential backoff increases the wait time between retries, giving the service time to recover. Add jitter to prevent thundering herd problems when multiple clients retry simultaneously.

import { createAgent } from '@waymakerai/aicofounder-core';

const agent = createAgent({
  model: 'gpt-4o',
  retry: {
    maxRetries: 3,
    strategy: 'exponential',
    initialDelayMs: 1000,  // First retry after 1s
    maxDelayMs: 30000,     // Cap at 30s
    jitter: true,          // Add randomness to prevent thundering herd
    retryableErrors: [429, 500, 502, 503, 504, 'TIMEOUT'],
    onRetry: (attempt, error, delayMs) => {
      console.warn(
        `Retry ${attempt}: ${error.message}, waiting ${delayMs}ms`
      );
    },
  },
});

// The agent automatically retries on transient failures
const result = await agent.run('Summarize this document');

Circuit Breaker Pattern

A circuit breaker prevents your application from repeatedly calling a failing service. After a threshold of consecutive failures, the circuit "opens" and all requests fail immediately for a cool-down period. This protects both your application (from hanging on slow failures) and the service (from being overwhelmed during recovery).

import { createAgent, CircuitBreaker } from '@waymakerai/aicofounder-core';

const breaker = new CircuitBreaker({
  failureThreshold: 5,      // Open after 5 consecutive failures
  resetTimeoutMs: 60000,    // Try again after 60 seconds
  halfOpenRequests: 2,      // Allow 2 test requests in half-open state
  onStateChange: (from, to) => {
    console.log(`Circuit breaker: ${from} -> ${to}`);
    if (to === 'open') {
      alertOps('LLM circuit breaker opened');
    }
  },
});

const agent = createAgent({
  model: 'gpt-4o',
  circuitBreaker: breaker,
});

try {
  const result = await agent.run('Hello');
} catch (error) {
  if (error.code === 'CIRCUIT_OPEN') {
    // Serve cached response or show friendly error
    return getCachedResponse() || 'Service temporarily unavailable';
  }
}

Model Fallback Chains

When your primary model is unavailable or too slow, falling back to an alternative model keeps your application running. CoFounder supports fallback chains that automatically try the next model when the current one fails. You can configure different fallback chains for different quality requirements.

A typical chain might be: GPT-4o (primary, highest quality) to Claude 3.5 Sonnet (first fallback, different provider) to GPT-4o-mini (second fallback, faster and cheaper). The key insight is to fall back across providers, not just models, so a single provider outage does not take down your application.

Degraded Mode Operation

When all LLM providers are down, your application should still function, just with reduced capabilities. Degraded mode strategies include: serving cached responses for common queries, using rule-based fallbacks for simple tasks, showing a queue notification and processing requests when the service recovers, or routing to a simpler local model.

Design your UI to communicate the degraded state clearly. Users are more forgiving of reduced functionality when they understand what is happening. CoFounder's health check system monitors provider availability and triggers degraded mode automatically, while exposing status indicators for your UI.