Scaling Strategies
AI agent applications face unique scaling challenges: LLM calls are slow and expensive, database connections are limited, and agent executions can hold resources for seconds or minutes. This lesson covers the patterns that let your CoFounder application handle thousands of concurrent users without falling over.
Horizontal Scaling
Scale your application horizontally by running multiple instances behind a load balancer. On Vercel, this happens automatically. On AWS or Kubernetes, configure auto-scaling based on CPU, memory, or custom metrics like active agent count:
# Kubernetes HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cofounder-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cofounder-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: cofounder_active_agents
target:
type: AverageValue
averageValue: "5" # Scale up when avg active agents per pod > 5Connection Pooling with PgBouncer
Serverless functions and auto-scaled containers open and close database connections rapidly, which can exhaust PostgreSQL's connection limit. PgBouncer sits between your application and database, pooling connections efficiently. Supabase includes PgBouncer on port 6543:
// lib/db.ts — connection pooling configuration
import { Pool } from 'pg';
// Direct connection (for migrations, admin tasks)
const directPool = new Pool({
connectionString: process.env.DATABASE_URL, // port 5432
max: 5,
});
// Pooled connection via PgBouncer (for application queries)
const pooledConnection = new Pool({
connectionString: process.env.DATABASE_URL_POOLED, // port 6543
max: 20, // Can be higher since PgBouncer manages the actual DB connections
});
// With Supabase, use the pooled URL for all application queries
// DATABASE_URL_POOLED=postgresql://postgres:password@db.project.supabase.co:6543/postgres
// For Prisma, set the connection in schema.prisma:
// datasource db {
// provider = "postgresql"
// url = env("DATABASE_URL_POOLED")
// directUrl = env("DATABASE_URL")
// }Redis Cluster and Caching
Redis serves multiple roles in a CoFounder application: session storage, rate limiting, LLM response caching, and pub/sub for real-time updates. As traffic grows, move from a single Redis instance to a cluster:
// lib/redis.ts
import Redis from 'ioredis';
// Single instance (development / small scale)
const redis = new Redis(process.env.REDIS_URL!);
// Redis Cluster (production at scale)
// const redis = new Redis.Cluster([
// { host: 'redis-1.example.com', port: 6379 },
// { host: 'redis-2.example.com', port: 6379 },
// { host: 'redis-3.example.com', port: 6379 },
// ]);
// Cache LLM responses for identical inputs
export async function cachedLLMCall(
cacheKey: string,
ttlSeconds: number,
callFn: () => Promise<any>
) {
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached);
}
const result = await callFn();
await redis.setex(cacheKey, ttlSeconds, JSON.stringify(result));
return result;
}
// Rate limiting with sliding window
export async function checkRateLimit(
userId: string,
maxRequests: number,
windowSeconds: number
): Promise<boolean> {
const key = `ratelimit:${userId}`;
const now = Date.now();
const pipe = redis.pipeline();
pipe.zremrangebyscore(key, 0, now - windowSeconds * 1000);
pipe.zadd(key, now.toString(), `${now}`);
pipe.zcard(key);
pipe.expire(key, windowSeconds);
const results = await pipe.exec();
const count = results?.[2]?.[1] as number;
return count <= maxRequests;
}Queue-Based Agent Execution
For agents that take more than a few seconds to complete, move execution to a background queue. This frees up your API servers and lets you control concurrency. Use BullMQ with Redis or AWS SQS:
// lib/agent-queue.ts
import { Queue, Worker } from 'bullmq';
import Redis from 'ioredis';
const connection = new Redis(process.env.REDIS_URL!, { maxRetriesPerRequest: null });
// Create the queue
export const agentQueue = new Queue('agent-execution', { connection });
// API route: enqueue agent work
export async function POST(req: Request) {
const { agentId, input, userId } = await req.json();
const job = await agentQueue.add('execute', {
agentId,
input,
userId,
traceContext: extractTraceContext(), // Propagate OTel context
}, {
priority: getUserPriority(userId),
attempts: 2,
backoff: { type: 'exponential', delay: 5000 },
removeOnComplete: { count: 1000 },
removeOnFail: { count: 5000 },
});
return Response.json({ jobId: job.id, status: 'queued' });
}
// Worker process (runs separately or in a dedicated container)
const worker = new Worker('agent-execution', async (job) => {
const { agentId, input, userId, traceContext } = job.data;
const result = await executeAgentWithContext(agentId, input, traceContext);
// Store result for polling or push via WebSocket
await redis.setex(`result:${job.id}`, 3600, JSON.stringify(result));
return result;
}, {
connection,
concurrency: 5, // Process 5 agents concurrently per worker
limiter: { max: 10, duration: 60_000 }, // Max 10 jobs per minute
});CDN and Static Asset Optimization
Offload static assets (JS bundles, images, fonts) to a CDN so your servers focus on dynamic agent requests. On Vercel this is automatic. On AWS, use CloudFront in front of S3. Set long cache headers for hashed assets and short TTLs for HTML:
_next/static/**— immutable, cache for 1 year.public/**— cache for 1 day with stale-while-revalidate.- HTML pages — cache for 60 seconds or use ISR (Incremental Static Regeneration).