Back to blog
TechnicalAIProductionArchitectureLessons Learned

Running AI Agents in Production: Hard-Won Lessons

What I learned deploying AI agents at scale and the mistakes that cost me thousands

3.865 minβ€’ 773 palabras
Max Techera

Max Techera

@maxtechera
Running AI Agents in Production: Hard-Won Lessons

After burning through $10k in API costs in a single weekend, I learned that running AI agents in production is nothing like local development.

Here's what I wish I knew before deploying my first agent swarm.

The $10k Weekend

It started innocently:

# What could go wrong?
while True:
    task = get_next_task()
    result = await agent.complete(task)
    if result.needs_clarification:
        clarification = await agent.clarify(result)
        # Recursive clarification... what could go wrong? 🀦

Spoiler: Everything could go wrong.

Lesson 1: Always Set Limits

Token Limits

const MAX_TOKENS_PER_REQUEST = 4000;
const MAX_REQUESTS_PER_MINUTE = 20;
const MAX_COST_PER_HOUR = 50; // USD
 
async function callAgent(prompt: string) {
  if (await checkLimits()) {
    return await agent.complete(prompt, { 
      maxTokens: MAX_TOKENS_PER_REQUEST 
    });
  }
  
  throw new Error("Rate limit exceeded");
}

Recursion Depth

MAX_CLARIFICATION_DEPTH = 3
 
async def get_clarification(result, depth=0):
    if depth >= MAX_CLARIFICATION_DEPTH:
        return result  # Give up gracefully
    
    clarification = await agent.clarify(result)
    if still_unclear(clarification):
        return await get_clarification(clarification, depth + 1)
    
    return clarification

Lesson 2: Context Window Management

The biggest production killer? Context overflow.

The Problem

  • User starts conversation
  • Context grows with each message
  • Eventually hits token limit
  • Agent becomes amnesiatic

The Solution

class ContextManager {
  private maxTokens = 8000;
  private buffer = 1000; // Safety margin
  
  async pruneContext(messages: Message[]) {
    const totalTokens = this.countTokens(messages);
    
    if (totalTokens > this.maxTokens - this.buffer) {
      // Smart pruning strategies
      return this.intelligentPrune(messages);
    }
    
    return messages;
  }
  
  private intelligentPrune(messages: Message[]) {
    // 1. Summarize old conversations
    // 2. Keep recent messages intact
    // 3. Preserve critical context
    return this.hybridStrategy(messages);
  }
}

Lesson 3: Error Handling That Actually Works

In production, errors aren't exceptionsβ€”they're the norm.

Common Failure Modes

  1. API Rate Limits: 429 errors everywhere
  2. Timeout Issues: Long-running completions
  3. Malformed Responses: AI hallucinates JSON
  4. Service Outages: When the API is down

Robust Error Handling

class ResilientAgent {
  async complete(prompt: string, retries = 3) {
    for (let i = 0; i < retries; i++) {
      try {
        return await this.tryComplete(prompt);
      } catch (error) {
        if (error.code === 429) {
          // Rate limit - exponential backoff
          await this.sleep(Math.pow(2, i) * 1000);
        } else if (error.code === 'TIMEOUT') {
          // Timeout - try with smaller context
          prompt = this.reducePrompt(prompt);
        } else if (error.code === 'MALFORMED') {
          // Bad response - try different model
          return await this.fallbackModel(prompt);
        }
      }
    }
    
    // All retries failed
    return this.gracefulDegradation(prompt);
  }
}

Lesson 4: Monitoring and Observability

You can't fix what you can't see.

Key Metrics

interface AgentMetrics {
  // Performance
  responseTime: number;
  tokensUsed: number;
  costPerRequest: number;
  
  // Reliability
  successRate: number;
  errorRate: number;
  timeoutRate: number;
  
  // Business
  taskCompletionRate: number;
  userSatisfaction: number;
}

Real-time Dashboards

Build dashboards that show:

  • Current API spend rate
  • Error rates by type
  • Response time percentiles
  • Context window usage

Lesson 5: Cost Optimization

Running agents efficiently requires constant optimization.

Strategies That Work

1. Model Selection

function selectModel(task: Task) {
  if (task.complexity === 'simple') {
    return 'gpt-3.5-turbo'; // 10x cheaper
  } else if (task.requiresReasoning) {
    return 'gpt-4'; // When you need it
  } else if (task.requiresSpeed) {
    return 'claude-instant'; // Fast and capable
  }
}

2. Caching

class SmartCache {
  async get(prompt: string) {
    // Check exact match
    const exact = await this.exactMatch(prompt);
    if (exact) return exact;
    
    // Check semantic similarity
    const similar = await this.semanticSearch(prompt);
    if (similar && similar.score > 0.95) {
      return similar.response;
    }
    
    return null;
  }
}

3. Batch Processing

async function batchProcess(tasks: Task[]) {
  // Group similar tasks
  const batches = this.groupBySimilarity(tasks);
  
  // Process each batch with shared context
  return Promise.all(
    batches.map(batch => this.processBatch(batch))
  );
}

The Architecture That Scales

After many iterations, here's what works:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Client    │────▢│ Rate Limiter │────▢│   Queue    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                               β”‚
                                               β–Ό
                                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                        β”‚   Router    β”‚
                                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                          β”‚    β”‚    β”‚
                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β–Ό               β–Ό    β–Ό    β–Ό               β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ Agent 1  β”‚   β”‚ Agent 2  β”‚  β”‚ Agent 3  β”‚ β”‚ Fallback β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Takeaways

  1. Production !== Development: What works locally will break at scale
  2. Limits Everything: Tokens, recursion, time, cost
  3. Monitor Obsessively: You need to see problems before users do
  4. Plan for Failure: It's not if, but when
  5. Optimize Continuously: Small improvements compound

What's Next?

I'm working on open-sourcing our agent orchestration framework. It includes:

  • Automatic retry logic
  • Cost controls
  • Context management
  • Performance monitoring

Follow for updates!


Building reliable AI systems, one failure at a time. What production lessons have you learned?

Compartir este artΓ­culo