Back to CourseLesson 6 of 10

Setting Up Monitoring

AI agent applications have unique monitoring needs beyond traditional web apps. You need to track LLM latency, token consumption, cost per request, error rates across providers, and agent step counts. This lesson shows how to build a comprehensive monitoring stack.

Application Metrics for AI Agents

Standard web metrics (request latency, error rate, throughput) remain important, but AI agents introduce additional dimensions you must track. Here are the key metrics every CoFounder production deployment should capture:

  • LLM call latency — p50, p95, p99 per model and provider.
  • Token usage — input tokens, output tokens, and total per request.
  • Cost per request — calculated from token usage and model pricing.
  • Agent steps — how many tool-use iterations before completion.
  • Error rate by type — rate limits, context length exceeded, provider outages.
  • Cache hit rate — for Redis-cached LLM responses.
// lib/metrics.ts
import { Counter, Histogram, Gauge, Registry } from 'prom-client';

export const registry = new Registry();

export const llmLatency = new Histogram({
  name: 'cofounder_llm_latency_seconds',
  help: 'LLM call latency in seconds',
  labelNames: ['provider', 'model', 'status'],
  buckets: [0.5, 1, 2, 5, 10, 30],
  registers: [registry],
});

export const tokenUsage = new Counter({
  name: 'cofounder_token_usage_total',
  help: 'Total tokens consumed',
  labelNames: ['provider', 'model', 'direction'], // direction: input | output
  registers: [registry],
});

export const agentSteps = new Histogram({
  name: 'cofounder_agent_steps',
  help: 'Number of steps per agent execution',
  labelNames: ['agent_type'],
  buckets: [1, 2, 3, 5, 8, 10, 15, 20],
  registers: [registry],
});

export const requestCost = new Counter({
  name: 'cofounder_request_cost_usd',
  help: 'Estimated cost in USD per request',
  labelNames: ['provider', 'model', 'user_tier'],
  registers: [registry],
});

export const activeAgents = new Gauge({
  name: 'cofounder_active_agents',
  help: 'Number of currently executing agents',
  registers: [registry],
});

Instrumenting LLM Calls

Wrap your LLM client calls with metric recording. This middleware pattern works with any provider and captures latency, tokens, and cost automatically:

// lib/llm-instrumented.ts
import { llmLatency, tokenUsage, requestCost } from './metrics';

const MODEL_PRICING: Record<string, { input: number; output: number }> = {
  'gpt-4o': { input: 0.0025, output: 0.01 },         // per 1K tokens
  'claude-sonnet-4-20250514': { input: 0.003, output: 0.015 },
};

export async function instrumentedLLMCall(
  provider: string,
  model: string,
  callFn: () => Promise<any>
) {
  const timer = llmLatency.startTimer({ provider, model });

  try {
    const result = await callFn();
    timer({ status: 'success' });

    // Record token usage
    const usage = result.usage;
    if (usage) {
      tokenUsage.inc({ provider, model, direction: 'input' }, usage.prompt_tokens);
      tokenUsage.inc({ provider, model, direction: 'output' }, usage.completion_tokens);

      // Calculate cost
      const pricing = MODEL_PRICING[model];
      if (pricing) {
        const cost =
          (usage.prompt_tokens / 1000) * pricing.input +
          (usage.completion_tokens / 1000) * pricing.output;
        requestCost.inc({ provider, model, user_tier: 'standard' }, cost);
      }
    }

    return result;
  } catch (error) {
    timer({ status: 'error' });
    throw error;
  }
}

Prometheus and Grafana Setup

Expose a /api/metrics endpoint that Prometheus can scrape, then visualize everything in Grafana. Add both services to your docker-compose for local development:

// app/api/metrics/route.ts
import { registry } from '@/lib/metrics';
import { NextResponse } from 'next/server';

export async function GET() {
  const metrics = await registry.metrics();
  return new NextResponse(metrics, {
    headers: { 'Content-Type': registry.contentType },
  });
}

// --- prometheus.yml ---
// scrape_configs:
//   - job_name: 'cofounder-app'
//     scrape_interval: 15s
//     static_configs:
//       - targets: ['app:3000']
//     metrics_path: '/api/metrics'

Alerting Rules

Configure alerts for conditions that indicate real problems, not noise. AI-specific alerts are critical because a silent provider outage or runaway agent can burn through your budget quickly:

# prometheus-alerts.yml
groups:
  - name: cofounder-alerts
    rules:
      - alert: HighLLMLatency
        expr: histogram_quantile(0.95, rate(cofounder_llm_latency_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM p95 latency above 10s for 5 minutes"

      - alert: HighErrorRate
        expr: rate(cofounder_llm_latency_seconds_count{status="error"}[5m]) / rate(cofounder_llm_latency_seconds_count[5m]) > 0.1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "LLM error rate above 10%"

      - alert: BudgetThresholdReached
        expr: sum(cofounder_request_cost_usd) > 400
        labels:
          severity: warning
        annotations:
          summary: "Monthly LLM cost approaching $500 budget"

      - alert: AgentStepCountHigh
        expr: histogram_quantile(0.95, rate(cofounder_agent_steps_bucket[15m])) > 12
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Agents are taking too many steps — possible loop detected"