Docs / Production Tracing

Production Tracing

Deploy AI applications with confidence using Evaligo's production tracing. Monitor performance, catch issues early, and understand real-world usage patterns through comprehensive observability.

Production tracing bridges the gap between development evaluation and live performance. While experiments help you build great AI features, tracing ensures they stay great once deployed to real users with real data at real scale.

This guide walks you through setting up tracing for your AI application, from initial SDK integration to advanced monitoring and alerting. You'll instrument your code, configure sampling, and create dashboards that help you maintain quality in production.

Tracing works with any LLM provider, any application architecture, and any deployment environment. Whether you're running a simple chatbot or a complex multi-agent system, these patterns will help you understand what's happening in production.

Why Production Tracing Matters

Even the best laboratory evaluation can't predict every real-world scenario. Users find edge cases you didn't anticipate, model performance shifts over time, and infrastructure issues can degrade quality in subtle ways.

Production tracing captures this reality, giving you data-driven insights into how your AI performs with actual users. It's your early warning system for quality degradation, cost spikes, and emerging edge cases.

Prerequisites

Before setting up tracing, ensure you have the following components ready for integration.

1
Evaligo Project Create a project in Evaligo to organize your production traces.
2
API Key Generate a production API key with tracing permissions from your project settings.
3
Application Access Ability to modify your AI application code to add SDK instrumentation.

Info

Start with a subset of production traffic (10-20%) to validate your tracing setup before instrumenting all requests. This helps you tune sampling rates and verify data quality.

Step 1: Install and Initialize the SDK

The Evaligo SDK provides automatic instrumentation for popular LLM providers and frameworks. Install it in your application environment and configure it with your project credentials.

The SDK automatically captures request/response data, timing information, token usage, and errors. You can extend this with custom metadata to track user sessions, feature flags, or business-specific context.

# Install the Evaligo SDK
npm install @evaligo/tracing
# or
pip install evaligo-tracing

# Basic initialization in your application
import { EvaaligoTracer } from '@evaligo/tracing'

const tracer = new EvaaligoTracer({
  apiKey: process.env.EVALIGO_API_KEY,
  projectId: process.env.EVALIGO_PROJECT_ID,
  environment: 'production', // or 'staging', 'dev'
  serviceName: 'customer-support-bot',
  version: '1.2.0'
})

// Initialize tracing (call once at app startup)
await tracer.init()

Step 2: Instrument Your LLM Calls

Wrap your existing LLM calls with Evaligo's tracing decorators. This captures the complete request lifecycle including prompts, responses, metadata, and performance metrics.

The SDK supports auto-instrumentation for OpenAI, Anthropic, AWS Bedrock, and other major providers. For custom providers or complex workflows, use manual instrumentation to capture exactly what you need.

Include relevant metadata with each trace to enable powerful filtering and analysis later. User IDs, session identifiers, feature flags, and request types all help you understand patterns in your data.

// Auto-instrumentation (recommended)
import { OpenAI } from 'openai'
import { instrument } from '@evaligo/tracing'

const openai = instrument(new OpenAI({ 
  apiKey: process.env.OPENAI_API_KEY 
}))

// Your existing code works unchanged
const response = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{ role: 'user', content: userQuery }],
  // Evaligo metadata (optional)
  metadata: {
    userId: req.user.id,
    sessionId: req.sessionId,
    feature: 'customer-support',
    priority: 'high'
  }
})

Step 3: Configure Sampling and Performance

Production systems generate large volumes of traces. Smart sampling balances observability needs with performance and cost constraints while ensuring you capture representative data.

Evaligo supports multiple sampling strategies: percentage-based for uniform coverage, rate-limiting for high-volume endpoints, and intelligent sampling that prioritizes errors and outliers.

Configure different sampling rates for different parts of your application. Critical user flows might trace at 100%, while background processes might sample at 1%.

// Configure sampling strategies
const tracer = new EvaaligoTracer({
  // ... other config
  sampling: {
    // Default sampling rate (10% of all requests)
    defaultRate: 0.1,
    
    // Always trace errors and slow requests
    alwaysTrace: {
      errors: true,
      slowRequests: true, // > 5 seconds
      highCost: true      // > $0.10 per request
    },
    
    // Per-endpoint configuration
    rules: [
      { pattern: '/api/support/*', rate: 0.5 },    // High-value user flows
      { pattern: '/api/internal/*', rate: 0.01 },  // Background jobs
      { pattern: '/health', rate: 0 }              // Health checks
    ]
  }
})

Warning

Tracing adds minimal overhead (typically <1ms per request), but high sampling rates on high-volume endpoints can impact performance. Start conservative and increase sampling as needed.

Step 4: Verify and Monitor Your Traces

Once tracing is deployed, verify data is flowing correctly and set up monitoring to catch issues proactively. The Evaligo dashboard provides real-time visibility into your application's performance.

Create saved queries for common investigations like error patterns, slow requests, high-cost operations, and user-specific issues. Set up alerts to notify you when key metrics exceed thresholds.

Video

Setting up production monitoring dashboards and alerts

Troubleshooting Common Issues

Production tracing setups can encounter various issues. Here are solutions to the most common problems teams face when getting started.

Missing Traces

Check API key permissions, network connectivity, and sampling configuration. Verify the SDK is initialized before making LLM calls.

High Overhead

Reduce sampling rates, increase batch sizes, or implement more selective tracing based on request characteristics.

Incomplete Data

Ensure all LLM providers are instrumented and custom spans are properly closed. Check for exceptions that might prevent trace completion.

Tip

Use trace data to identify optimization opportunities. Common patterns include expensive prompts, redundant LLM calls, and inefficient data retrieval that only become apparent at production scale.

Next Steps

With production tracing operational, you can leverage this data for continuous improvement, automated quality gates, and proactive issue detection.

Query and Analyze Traces

Investigate patterns and debug issues

Cost Tracking

Monitor and optimize LLM costs

Create Dashboards

Build custom monitoring views

Set Up Alerts

Get notified of issues proactively