Retry and Fallback Strategies
LLM APIs are inherently unreliable: rate limits, timeouts, server errors, and capacity issues are facts of life. A production application must handle these failures gracefully. This lesson covers exponential backoff, circuit breakers, model fallback chains, and designing for degraded operation.
Exponential Backoff
When an LLM API returns a transient error (429 rate limit, 500 server error, or timeout), retrying immediately usually makes things worse. Exponential backoff increases the wait time between retries, giving the service time to recover. Add jitter to prevent thundering herd problems when multiple clients retry simultaneously.
import { createAgent } from '@waymakerai/aicofounder-core';
const agent = createAgent({
model: 'gpt-4o',
retry: {
maxRetries: 3,
strategy: 'exponential',
initialDelayMs: 1000, // First retry after 1s
maxDelayMs: 30000, // Cap at 30s
jitter: true, // Add randomness to prevent thundering herd
retryableErrors: [429, 500, 502, 503, 504, 'TIMEOUT'],
onRetry: (attempt, error, delayMs) => {
console.warn(
`Retry ${attempt}: ${error.message}, waiting ${delayMs}ms`
);
},
},
});
// The agent automatically retries on transient failures
const result = await agent.run('Summarize this document');Circuit Breaker Pattern
A circuit breaker prevents your application from repeatedly calling a failing service. After a threshold of consecutive failures, the circuit "opens" and all requests fail immediately for a cool-down period. This protects both your application (from hanging on slow failures) and the service (from being overwhelmed during recovery).
import { createAgent, CircuitBreaker } from '@waymakerai/aicofounder-core';
const breaker = new CircuitBreaker({
failureThreshold: 5, // Open after 5 consecutive failures
resetTimeoutMs: 60000, // Try again after 60 seconds
halfOpenRequests: 2, // Allow 2 test requests in half-open state
onStateChange: (from, to) => {
console.log(`Circuit breaker: ${from} -> ${to}`);
if (to === 'open') {
alertOps('LLM circuit breaker opened');
}
},
});
const agent = createAgent({
model: 'gpt-4o',
circuitBreaker: breaker,
});
try {
const result = await agent.run('Hello');
} catch (error) {
if (error.code === 'CIRCUIT_OPEN') {
// Serve cached response or show friendly error
return getCachedResponse() || 'Service temporarily unavailable';
}
}Model Fallback Chains
When your primary model is unavailable or too slow, falling back to an alternative model keeps your application running. CoFounder supports fallback chains that automatically try the next model when the current one fails. You can configure different fallback chains for different quality requirements.
A typical chain might be: GPT-4o (primary, highest quality) to Claude 3.5 Sonnet (first fallback, different provider) to GPT-4o-mini (second fallback, faster and cheaper). The key insight is to fall back across providers, not just models, so a single provider outage does not take down your application.
Degraded Mode Operation
When all LLM providers are down, your application should still function, just with reduced capabilities. Degraded mode strategies include: serving cached responses for common queries, using rule-based fallbacks for simple tasks, showing a queue notification and processing requests when the service recovers, or routing to a simpler local model.
Design your UI to communicate the degraded state clearly. Users are more forgiving of reduced functionality when they understand what is happening. CoFounder's health check system monitors provider availability and triggers degraded mode automatically, while exposing status indicators for your UI.