Cost Optimization Techniques

LLM costs can spiral quickly at scale. A well-optimized application can reduce costs by 60-80% without sacrificing quality. This lesson covers practical techniques: prompt compression, intelligent model routing, caching, token budgets, and using CoFounder's cost tracker to identify optimization opportunities.

Prompt Compression

Every token in your prompt costs money, and context windows have limits. Prompt compression reduces the size of your inputs without losing meaning. Techniques include: summarizing conversation history instead of sending full transcripts, stripping unnecessary formatting and whitespace, using abbreviations in system prompts, and compressing RAG context to only the most relevant passages.

import { createAgent, compressHistory } from '@waymakerai/aicofounder-core';

const agent = createAgent({
  model: 'gpt-4o',
  contextManagement: {
    // Compress conversation history when it exceeds 4k tokens
    maxHistoryTokens: 4000,
    compressionStrategy: 'summarize',
    // Keep the last 3 messages verbatim, summarize older ones
    keepRecentMessages: 3,
  },
});

// Manual compression for custom scenarios
const longHistory = await getConversationHistory(sessionId);
const compressed = await compressHistory(longHistory, {
  targetTokens: 2000,
  preserveToolCalls: true,  // Keep tool call/results intact
  preserveSystemContext: true,
});

const result = await agent.run('Continue our conversation', {
  history: compressed,
});

console.log(`Saved ${longHistory.tokenCount - compressed.tokenCount} tokens`);

Model Routing by Complexity

Not every request needs GPT-4o. Simple questions, formatting tasks, and classification can use smaller, cheaper models. A complexity router analyzes the incoming request and routes it to the most cost-effective model that can handle it well.

import { createAgent, ModelRouter } from '@waymakerai/aicofounder-core';

const router = new ModelRouter({
  rules: [
    {
      // Simple factual queries -> cheapest model
      condition: (prompt) => prompt.length < 100 && !prompt.includes('code'),
      model: 'gpt-4o-mini',
      costPerMToken: 0.15,
    },
    {
      // Code generation -> best model
      condition: (prompt) => prompt.includes('code') || prompt.includes('function'),
      model: 'gpt-4o',
      costPerMToken: 2.50,
    },
    {
      // Long analysis tasks -> balanced model
      condition: (prompt) => prompt.length > 500,
      model: 'claude-3-5-sonnet',
      costPerMToken: 3.00,
    },
  ],
  default: 'gpt-4o-mini',
});

const agent = createAgent({
  modelRouter: router,
});

// Automatically routed to the best model for the task
const result = await agent.run('What is 2+2?');
console.log(result.model); // 'gpt-4o-mini' - cheapest option
console.log(result.cost);  // $0.0001

Token Budgets

Set hard limits on how many tokens a single request, user, or session can consume. Token budgets prevent runaway costs from verbose prompts, recursive agent loops, or malicious input. CoFounder enforces budgets at multiple levels: per-request, per-session, per-user daily, and global monthly.

When a budget is exceeded, the agent can either fail with a clear error, truncate its response at the budget limit, or switch to a cheaper model to stay within budget. The strategy depends on your application's requirements.

CoFounder Cost Tracker

CoFounder's cost tracker records every LLM call with its token usage, model, latency, and calculated cost. This data is stored in Supabase and exposed through a dashboard API. Use it to identify your most expensive prompts, track cost trends over time, and set up alerts when spending exceeds thresholds.

The cost tracker integrates with the caching layer to show you exactly how much you are saving. It also provides per-user cost breakdowns, letting you implement usage-based billing or identify users who are disproportionately expensive to serve.

Practical Cost Reduction Checklist

Apply these optimizations in order of impact: first, implement caching (40-60% cost reduction for typical applications). Second, route simple queries to smaller models (20-30% additional savings). Third, compress conversation history (10-20% savings on long conversations). Fourth, set token budgets to prevent outlier costs. Finally, audit your system prompts quarterly to remove unnecessary instructions that consume tokens on every request.