Docs / Log experiment runs
Log experiment runs
Integrate experiment tracking directly into your development workflow by logging runs programmatically. Keep all experiment results centralized while maintaining the flexibility to run experiments from anywhere in your infrastructure.
Programmatic experiment logging enables teams to embed evaluation into existing workflows, from local development scripts to production CI/CD pipelines. This ensures that all experiments are tracked consistently, regardless of where they originate.
By logging experiments through the SDK, you maintain a complete audit trail of all model changes and their performance impacts. This centralized approach makes it easy to compare results across different environments and track improvements over time.

SDK Integration
The Evaligo SDK provides a simple interface for logging experiment runs from any Python environment. Whether you're running experiments in notebooks, scripts, or automated pipelines, the SDK ensures consistent data capture and formatting.
- 1
Install and configure Set up the SDK with your API key and workspace configuration for seamless integration.
- 2
Define experiment parameters Specify model settings, prompt variants, and evaluation criteria before starting the run.
- 3
Execute and log results Run your experiment and automatically capture all outputs, metrics, and metadata.
- 4
Review in dashboard Access all programmatic runs through the same interface as UI-based experiments.
import evaligo
# Initialize the client
client = evaligo.Client(api_key="your-api-key")
# Create an experiment run
experiment = client.experiments.create_run(
name="gpt-4-customer-support-v2",
dataset_id="customer-qa-dataset",
model_config={
"provider": "openai",
"model": "gpt-4-turbo",
"temperature": 0.3,
"max_tokens": 1000
},
prompt_template="""
You are a helpful customer support agent.
Customer question: {question}
Context: {context}
Provide a helpful, accurate response.
""",
metadata={
"git_commit": "abc123def",
"environment": "staging",
"dataset_version": "2.1.0"
}
)
# Log individual results as they complete
for result in experiment_results:
experiment.log_result(
input_data=result.input,
output=result.output,
duration_ms=result.duration,
token_usage=result.tokens
)
Metadata and Context
Rich metadata is crucial for making experiments comparable and auditable. Capture environmental context, code versions, and configuration details to ensure results can be properly interpreted and reproduced.
Consistent metadata practices enable powerful filtering and analysis capabilities. You can quickly identify experiments from specific code branches, compare performance across environments, or track the impact of configuration changes over time.

Metadata Best Practices: Include git commit hashes, dataset versions, environment names, and any relevant business context. This makes it easy to correlate experiment results with code changes and environmental factors.
import git
import os
from datetime import datetime
# Automatically capture environment context
repo = git.Repo(search_parent_directories=True)
git_info = {
"commit_hash": repo.head.commit.hexsha,
"branch": repo.active_branch.name,
"is_dirty": repo.is_dirty(),
"author": repo.head.commit.author.name
}
# Create experiment with rich metadata
experiment = client.experiments.create_run(
name=f"prompt-optimization-{datetime.now().strftime('%Y%m%d-%H%M')}",
dataset_id="production-samples-v3",
metadata={
**git_info,
"environment": os.getenv("DEPLOY_ENV", "development"),
"dataset_version": "3.2.1",
"experiment_type": "prompt_optimization",
"business_context": "q4-performance-improvement",
"model_baseline": "gpt-3.5-turbo",
"hypothesis": "Adding examples improves accuracy by 10%",
"requester": "product-team",
"priority": "high"
},
tags=["prompt-engineering", "q4-roadmap", "accuracy-focus"]
)
CI/CD Integration
Embed experiment logging into your continuous integration pipelines to automatically evaluate model changes with every code commit. This creates a safety net that catches performance regressions before they reach production.
CI integration enables automated quality gates where deployments are blocked if experiment results fall below defined thresholds. This shift-left approach to AI quality ensures that only improvements make it to production systems.
Video

name: AI Model Evaluation
on:
pull_request:
paths: ['prompts/**', 'models/**']
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: |
pip install evaligo[all]
- name: Run experiment evaluation
env:
EVALIGO_API_KEY: ${{ secrets.EVALIGO_API_KEY }}
GITHUB_SHA: ${{ github.sha }}
run: |
python scripts/evaluate_changes.py \
--baseline-commit ${{ github.event.pull_request.base.sha }} \
--current-commit ${{ github.sha }} \
--dataset production-regression-tests \
--threshold 0.95
- name: Comment PR with results
uses: actions/github-script@v6
with:
script: |
const results = require('./experiment_results.json');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## 🧪 Experiment Results
**Accuracy:** ${results.accuracy} (${results.accuracy_delta > 0 ? '+' : ''}${results.accuracy_delta}%)
**Latency:** ${results.avg_latency}ms (${results.latency_delta > 0 ? '+' : ''}${results.latency_delta}ms)
[View detailed results](${results.dashboard_url})
`
});
Notifications and Monitoring
Set up webhooks and notifications to stay informed about experiment completion and any significant results. This enables rapid iteration cycles and ensures that important findings are communicated to relevant stakeholders promptly.
Webhook Events: Configure webhooks for experiment completion, threshold breaches, and error conditions to integrate with your existing notification systems like Slack, PagerDuty, or custom dashboards.

# Configure experiment with webhook notifications
experiment = client.experiments.create_run(
name="production-readiness-test",
dataset_id="critical-paths-dataset",
webhooks={
"on_completion": "https://hooks.slack.com/services/...",
"on_threshold_breach": "https://api.pagerduty.com/...",
"on_error": "https://your-monitoring-system.com/webhook"
},
alert_thresholds={
"min_accuracy": 0.90,
"max_latency_p95": 2000,
"max_cost_per_request": 0.05
}
)
# Automatic notifications will be sent based on results
experiment.start()