Setting Up Monitoring
AI agent applications have unique monitoring needs beyond traditional web apps. You need to track LLM latency, token consumption, cost per request, error rates across providers, and agent step counts. This lesson shows how to build a comprehensive monitoring stack.
Application Metrics for AI Agents
Standard web metrics (request latency, error rate, throughput) remain important, but AI agents introduce additional dimensions you must track. Here are the key metrics every CoFounder production deployment should capture:
- LLM call latency — p50, p95, p99 per model and provider.
- Token usage — input tokens, output tokens, and total per request.
- Cost per request — calculated from token usage and model pricing.
- Agent steps — how many tool-use iterations before completion.
- Error rate by type — rate limits, context length exceeded, provider outages.
- Cache hit rate — for Redis-cached LLM responses.
// lib/metrics.ts
import { Counter, Histogram, Gauge, Registry } from 'prom-client';
export const registry = new Registry();
export const llmLatency = new Histogram({
name: 'cofounder_llm_latency_seconds',
help: 'LLM call latency in seconds',
labelNames: ['provider', 'model', 'status'],
buckets: [0.5, 1, 2, 5, 10, 30],
registers: [registry],
});
export const tokenUsage = new Counter({
name: 'cofounder_token_usage_total',
help: 'Total tokens consumed',
labelNames: ['provider', 'model', 'direction'], // direction: input | output
registers: [registry],
});
export const agentSteps = new Histogram({
name: 'cofounder_agent_steps',
help: 'Number of steps per agent execution',
labelNames: ['agent_type'],
buckets: [1, 2, 3, 5, 8, 10, 15, 20],
registers: [registry],
});
export const requestCost = new Counter({
name: 'cofounder_request_cost_usd',
help: 'Estimated cost in USD per request',
labelNames: ['provider', 'model', 'user_tier'],
registers: [registry],
});
export const activeAgents = new Gauge({
name: 'cofounder_active_agents',
help: 'Number of currently executing agents',
registers: [registry],
});Instrumenting LLM Calls
Wrap your LLM client calls with metric recording. This middleware pattern works with any provider and captures latency, tokens, and cost automatically:
// lib/llm-instrumented.ts
import { llmLatency, tokenUsage, requestCost } from './metrics';
const MODEL_PRICING: Record<string, { input: number; output: number }> = {
'gpt-4o': { input: 0.0025, output: 0.01 }, // per 1K tokens
'claude-sonnet-4-20250514': { input: 0.003, output: 0.015 },
};
export async function instrumentedLLMCall(
provider: string,
model: string,
callFn: () => Promise<any>
) {
const timer = llmLatency.startTimer({ provider, model });
try {
const result = await callFn();
timer({ status: 'success' });
// Record token usage
const usage = result.usage;
if (usage) {
tokenUsage.inc({ provider, model, direction: 'input' }, usage.prompt_tokens);
tokenUsage.inc({ provider, model, direction: 'output' }, usage.completion_tokens);
// Calculate cost
const pricing = MODEL_PRICING[model];
if (pricing) {
const cost =
(usage.prompt_tokens / 1000) * pricing.input +
(usage.completion_tokens / 1000) * pricing.output;
requestCost.inc({ provider, model, user_tier: 'standard' }, cost);
}
}
return result;
} catch (error) {
timer({ status: 'error' });
throw error;
}
}Prometheus and Grafana Setup
Expose a /api/metrics endpoint that Prometheus can scrape, then visualize everything in Grafana. Add both services to your docker-compose for local development:
// app/api/metrics/route.ts
import { registry } from '@/lib/metrics';
import { NextResponse } from 'next/server';
export async function GET() {
const metrics = await registry.metrics();
return new NextResponse(metrics, {
headers: { 'Content-Type': registry.contentType },
});
}
// --- prometheus.yml ---
// scrape_configs:
// - job_name: 'cofounder-app'
// scrape_interval: 15s
// static_configs:
// - targets: ['app:3000']
// metrics_path: '/api/metrics'Alerting Rules
Configure alerts for conditions that indicate real problems, not noise. AI-specific alerts are critical because a silent provider outage or runaway agent can burn through your budget quickly:
# prometheus-alerts.yml
groups:
- name: cofounder-alerts
rules:
- alert: HighLLMLatency
expr: histogram_quantile(0.95, rate(cofounder_llm_latency_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "LLM p95 latency above 10s for 5 minutes"
- alert: HighErrorRate
expr: rate(cofounder_llm_latency_seconds_count{status="error"}[5m]) / rate(cofounder_llm_latency_seconds_count[5m]) > 0.1
for: 3m
labels:
severity: critical
annotations:
summary: "LLM error rate above 10%"
- alert: BudgetThresholdReached
expr: sum(cofounder_request_cost_usd) > 400
labels:
severity: warning
annotations:
summary: "Monthly LLM cost approaching $500 budget"
- alert: AgentStepCountHigh
expr: histogram_quantile(0.95, rate(cofounder_agent_steps_bucket[15m])) > 12
for: 10m
labels:
severity: warning
annotations:
summary: "Agents are taking too many steps — possible loop detected"