Caching Strategies

LLM calls are expensive and often slow. A well-designed caching strategy can dramatically reduce costs and latency. This lesson covers Redis integration for caching agent responses, TTL strategies for different content types, cache invalidation patterns, and CoFounder's built-in caching layer.

Redis Cache Integration

Redis is the ideal backend for caching LLM responses: it is fast, supports TTL natively, and can handle the concurrent access patterns typical of web applications. CoFounder provides a Redis cache adapter that plugs into the agent pipeline transparently.

import { createAgent, RedisCacheAdapter } from '@waymakerai/aicofounder-core';
import Redis from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);

const cache = new RedisCacheAdapter({
  client: redis,
  prefix: 'agent:cache:',
  defaultTTL: 3600, // 1 hour
});

const agent = createAgent({
  model: 'gpt-4o',
  cache: {
    adapter: cache,
    // Cache key is derived from the prompt + model + system prompt
    keyStrategy: 'content-hash',
    // Only cache successful completions
    cacheErrors: false,
  },
});

// First call hits the LLM
const result1 = await agent.run('What is the capital of France?');
console.log(result1.cached); // false

// Second identical call returns from cache
const result2 = await agent.run('What is the capital of France?');
console.log(result2.cached); // true
console.log(result2.latencyMs); // ~2ms vs ~800ms

TTL Strategies

Different types of content need different TTL values. Factual knowledge queries can be cached for hours or days. Creative outputs should have shorter TTLs or not be cached at all, since users expect unique responses. Time-sensitive queries (weather, stock prices) need very short TTLs or should bypass the cache entirely.

CoFounder lets you define TTL strategies per prompt category. The cache adapter accepts a ttlResolver function that examines the prompt and response to determine the appropriate TTL. This lets you implement intelligent caching that adapts to the content.

Cache Invalidation

Cache invalidation is famously one of the two hard problems in computer science. For LLM caches, the main triggers for invalidation are: model updates (when the provider ships a new model version), context changes (when the system prompt or tools change), and data freshness (when underlying data the agent references is updated).

import { RedisCacheAdapter } from '@waymakerai/aicofounder-core';

const cache = new RedisCacheAdapter({ client: redis, prefix: 'agent:cache:' });

// Invalidate by pattern when system prompt changes
await cache.invalidateByPattern('agent:cache:gpt-4o:sys-v2:*');

// Invalidate specific keys when source data changes
await cache.invalidateByTag('datasource:products');

// Version-based invalidation: include version in cache key
const agent = createAgent({
  model: 'gpt-4o',
  cache: {
    adapter: cache,
    keyStrategy: 'content-hash',
    version: 'v3', // Bump this when system prompt changes
  },
});

// Scheduled cache warming for common queries
async function warmCache(commonQueries: string[]) {
  for (const query of commonQueries) {
    await agent.run(query); // Populates cache
  }
}

CoFounder's Built-in Cache Layer

CoFounder ships with a two-tier caching system: an in-memory LRU cache for the current process and an optional Redis layer for distributed caching. The in-memory cache handles repeated queries within a single server instance with sub-millisecond latency, while Redis provides cache sharing across instances in a horizontally scaled deployment.

The built-in cache is content-addressable: the cache key is a hash of the prompt, system message, model identifier, temperature, and tool definitions. This means identical requests always hit the cache, regardless of which user or session made the request. For personalized responses, you can include user-specific context in the cache key to prevent cross-user cache contamination.

Measuring Cache Effectiveness

CoFounder exposes cache metrics through its cost tracking system: hit rate, miss rate, average latency saved, and estimated cost savings. Monitor these metrics to tune your TTL values and identify opportunities for cache warming. A well-tuned cache typically achieves 40-60% hit rates for production applications, translating directly to cost and latency reductions.