Docs / Log experiment runs

Log experiment runs

Integrate experiment tracking directly into your development workflow by logging runs programmatically. Keep all experiment results centralized while maintaining the flexibility to run experiments from anywhere in your infrastructure.

Programmatic experiment logging enables teams to embed evaluation into existing workflows, from local development scripts to production CI/CD pipelines. This ensures that all experiments are tracked consistently, regardless of where they originate.

By logging experiments through the SDK, you maintain a complete audit trail of all model changes and their performance impacts. This centralized approach makes it easy to compare results across different environments and track improvements over time.

Experiment dashboard showing programmatic runs alongside UI-initiated experiments with metadata and status

SDK Integration

The Evaligo SDK provides a simple interface for logging experiment runs from any Python environment. Whether you're running experiments in notebooks, scripts, or automated pipelines, the SDK ensures consistent data capture and formatting.

  1. 1

    Install and configure Set up the SDK with your API key and workspace configuration for seamless integration.

  2. 2

    Define experiment parameters Specify model settings, prompt variants, and evaluation criteria before starting the run.

  3. 3

    Execute and log results Run your experiment and automatically capture all outputs, metrics, and metadata.

  4. 4

    Review in dashboard Access all programmatic runs through the same interface as UI-based experiments.

Basic experiment logging
import evaligo

# Initialize the client
client = evaligo.Client(api_key="your-api-key")

# Create an experiment run
experiment = client.experiments.create_run(
    name="gpt-4-customer-support-v2",
    dataset_id="customer-qa-dataset",
    model_config={
        "provider": "openai",
        "model": "gpt-4-turbo",
        "temperature": 0.3,
        "max_tokens": 1000
    },
    prompt_template="""
    You are a helpful customer support agent.
    Customer question: {question}
    Context: {context}
    
    Provide a helpful, accurate response.
    """,
    metadata={
        "git_commit": "abc123def",
        "environment": "staging",
        "dataset_version": "2.1.0"
    }
)

# Log individual results as they complete
for result in experiment_results:
    experiment.log_result(
        input_data=result.input,
        output=result.output,
        duration_ms=result.duration,
        token_usage=result.tokens
    )

Metadata and Context

Rich metadata is crucial for making experiments comparable and auditable. Capture environmental context, code versions, and configuration details to ensure results can be properly interpreted and reproduced.

Consistent metadata practices enable powerful filtering and analysis capabilities. You can quickly identify experiments from specific code branches, compare performance across environments, or track the impact of configuration changes over time.

Experiment metadata view showing git commit, environment details, dataset versions, and custom tags
Info

Metadata Best Practices: Include git commit hashes, dataset versions, environment names, and any relevant business context. This makes it easy to correlate experiment results with code changes and environmental factors.

Comprehensive metadata logging
import git
import os
from datetime import datetime

# Automatically capture environment context
repo = git.Repo(search_parent_directories=True)
git_info = {
    "commit_hash": repo.head.commit.hexsha,
    "branch": repo.active_branch.name,
    "is_dirty": repo.is_dirty(),
    "author": repo.head.commit.author.name
}

# Create experiment with rich metadata
experiment = client.experiments.create_run(
    name=f"prompt-optimization-{datetime.now().strftime('%Y%m%d-%H%M')}",
    dataset_id="production-samples-v3",
    metadata={
        **git_info,
        "environment": os.getenv("DEPLOY_ENV", "development"),
        "dataset_version": "3.2.1",
        "experiment_type": "prompt_optimization",
        "business_context": "q4-performance-improvement",
        "model_baseline": "gpt-3.5-turbo",
        "hypothesis": "Adding examples improves accuracy by 10%",
        "requester": "product-team",
        "priority": "high"
    },
    tags=["prompt-engineering", "q4-roadmap", "accuracy-focus"]
)

CI/CD Integration

Embed experiment logging into your continuous integration pipelines to automatically evaluate model changes with every code commit. This creates a safety net that catches performance regressions before they reach production.

CI integration enables automated quality gates where deployments are blocked if experiment results fall below defined thresholds. This shift-left approach to AI quality ensures that only improvements make it to production systems.

Video

CI/CD Experiment Integration
CI/CD Experiment Integration
Learn how to set up automated experiment runs in GitHub Actions and integrate results into your deployment pipeline.
4m 30s
GitHub Actions workflow
name: AI Model Evaluation
on:
  pull_request:
    paths: ['prompts/**', 'models/**']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
          
      - name: Install dependencies
        run: |
          pip install evaligo[all]
          
      - name: Run experiment evaluation
        env:
          EVALIGO_API_KEY: ${{ secrets.EVALIGO_API_KEY }}
          GITHUB_SHA: ${{ github.sha }}
        run: |
          python scripts/evaluate_changes.py \
            --baseline-commit ${{ github.event.pull_request.base.sha }} \
            --current-commit ${{ github.sha }} \
            --dataset production-regression-tests \
            --threshold 0.95
            
      - name: Comment PR with results
        uses: actions/github-script@v6
        with:
          script: |
            const results = require('./experiment_results.json');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## 🧪 Experiment Results
              
              **Accuracy:** ${results.accuracy} (${results.accuracy_delta > 0 ? '+' : ''}${results.accuracy_delta}%)
              **Latency:** ${results.avg_latency}ms (${results.latency_delta > 0 ? '+' : ''}${results.latency_delta}ms)
              
              [View detailed results](${results.dashboard_url})
              `
            });

Notifications and Monitoring

Set up webhooks and notifications to stay informed about experiment completion and any significant results. This enables rapid iteration cycles and ensures that important findings are communicated to relevant stakeholders promptly.

Info

Webhook Events: Configure webhooks for experiment completion, threshold breaches, and error conditions to integrate with your existing notification systems like Slack, PagerDuty, or custom dashboards.

Notification settings showing webhook configuration for Slack integration with experiment status updates
Webhook configuration
# Configure experiment with webhook notifications
experiment = client.experiments.create_run(
    name="production-readiness-test",
    dataset_id="critical-paths-dataset",
    webhooks={
        "on_completion": "https://hooks.slack.com/services/...",
        "on_threshold_breach": "https://api.pagerduty.com/...",
        "on_error": "https://your-monitoring-system.com/webhook"
    },
    alert_thresholds={
        "min_accuracy": 0.90,
        "max_latency_p95": 2000,
        "max_cost_per_request": 0.05
    }
)

# Automatic notifications will be sent based on results
experiment.start()

Related Documentation

Run Experiments
Learn the basics of experiment setup and execution
Compare Results
Analyze and compare experiment outcomes
CI/CD for Experiments
Advanced CI/CD integration patterns
Run Evaluations with Code
Programmatic evaluation workflows