Docs / CI/CD for experiments
CI/CD for experiments
Build robust deployment pipelines by integrating experiment evaluation directly into your CI/CD workflow. Automatically catch performance regressions, enforce quality standards, and maintain historical baselines across all code changes.
CI/CD integration transforms evaluation from a manual process into an automated safety net that runs with every code change. This shift-left approach to AI quality ensures that only improvements reach production, reducing the risk of deploying models that perform worse than existing baselines.
Automated experiment pipelines enable faster iteration cycles while maintaining rigorous quality standards. Teams can confidently make changes knowing that comprehensive evaluation will catch any unintended consequences before they impact users.

Pipeline Integration Patterns
Design CI/CD workflows that automatically trigger experiments when AI-related code changes, from prompt modifications to model updates. Different integration patterns serve different team needs and risk tolerance levels.
- 1
Pull request validation Run lightweight experiments on code changes to catch obvious regressions before merge.
- 2
Merge-to-main evaluation Execute comprehensive experiments after merge to validate integration quality.
- 3
Pre-deployment testing Run full evaluation suites before promoting changes to staging or production environments.
- 4
Continuous monitoring Ongoing evaluation in production to detect drift and performance degradation over time.
Layered Approach: Use fast, lightweight checks for PR validation and comprehensive evaluations for deployment gates. This balances speed with thoroughness.
name: AI Model Quality Gate
on:
pull_request:
paths: ['prompts/**', 'models/**', 'evaluators/**']
push:
branches: [main]
jobs:
quick-validation:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Quick Evaluation
env:
EVALIGO_API_KEY: ${{ secrets.EVALIGO_API_KEY }}
run: |
evaligo experiment run \
--config .evaligo/pr-validation.yaml \
--dataset smoke-tests \
--baseline main \
--timeout 10m \
--fail-on-regression
comprehensive-evaluation:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Full Evaluation Suite
env:
EVALIGO_API_KEY: ${{ secrets.EVALIGO_API_KEY }}
run: |
evaligo experiment run \
--config .evaligo/comprehensive.yaml \
--dataset production-regression-tests \
--parallel 4 \
--generate-report \
--slack-notify #ai-team
- name: Update Baseline
if: success()
run: |
evaligo baselines update \
--experiment ${{ github.sha }} \
--environment staging
Quality Gates and Thresholds
Define automated decision criteria that determine whether changes can proceed through the pipeline. Quality gates provide objective, consistent evaluation criteria that reduce the need for manual review while maintaining high standards.
Effective threshold management balances sensitivity with practicality. Too strict thresholds block legitimate improvements, while too lenient thresholds fail to catch meaningful regressions. Establish thresholds based on historical performance and business impact analysis.

# .evaligo/quality-gates.yaml
gates:
pr_validation:
description: "Fast checks for pull requests"
thresholds:
accuracy_regression:
max_decrease: 2%
confidence_level: 90%
latency_p95:
max_increase: 200ms
cost_per_request:
max_increase: 15%
evaluators:
- groundedness_fast
- toxicity_basic
- coherence_simple
deployment_gate:
description: "Comprehensive pre-deployment validation"
thresholds:
accuracy_regression:
max_decrease: 1%
confidence_level: 95%
hallucination_rate:
max_rate: 0.05
user_satisfaction_proxy:
min_score: 0.85
latency_p99:
max_value: 3000ms
cost_efficiency:
min_accuracy_per_dollar: 10.0
evaluators:
- groundedness_comprehensive
- toxicity_advanced
- domain_specific_quality
- user_intent_alignment
failure_action: "block_deployment"
success_action: "update_baseline"
Baseline Management
Maintain rolling baselines that adapt to legitimate improvements while catching genuine regressions. Baseline management ensures that quality comparisons remain meaningful as your AI system evolves over time.
Automated baseline updates prevent drift where gradually declining performance becomes the new normal. By systematically tracking and updating baselines, teams maintain awareness of their system's true performance trajectory.
Video

- 1
Environment-specific baselines Maintain separate baselines for development, staging, and production to account for environmental differences.
- 2
Automated updates Update baselines automatically when deployments pass all quality gates and show sustained improvement.
- 3
Rollback protection Preserve baseline history to enable rollback when new changes cause unexpected regressions.
- 4
Trend monitoring Track baseline evolution over time to identify gradual performance changes and system drift.
# Automated baseline update script
import evaligo
from datetime import datetime, timedelta
client = evaligo.Client()
# Get recent successful deployments
recent_deployments = client.deployments.list(
status="success",
since=datetime.now() - timedelta(days=7),
environment="production"
)
# Find candidates for baseline update
for deployment in recent_deployments:
experiment = deployment.experiment
current_baseline = client.baselines.get_current("production")
# Check if improvement is sustained
improvement_metrics = experiment.compare_to_baseline(current_baseline)
if (improvement_metrics.accuracy.improvement > 0.02 and
improvement_metrics.statistical_significance > 0.95 and
deployment.uptime_hours > 72): # 3 days stable
# Update baseline
new_baseline = client.baselines.create(
name=f"production-baseline-{deployment.version}",
experiment_id=experiment.id,
metrics=experiment.get_metrics(),
metadata={
"deployment_date": deployment.created_at,
"improvement_summary": improvement_metrics.summary,
"previous_baseline": current_baseline.id
}
)
# Notify team
client.notifications.send(
channel="#ai-team",
message=f"🎯 Baseline updated! New production baseline shows {improvement_metrics.accuracy.improvement:.1%} accuracy improvement."
)
Notification and Alerting
Configure intelligent notifications that provide actionable information when quality gates fail or performance trends change. Effective alerting reduces noise while ensuring that important issues receive immediate attention.
Alert Fatigue: Configure notifications carefully to avoid overwhelming teams with false positives. Use different channels and urgency levels for different types of issues.

# .evaligo/notifications.yaml
channels:
slack:
webhook_url: ${{ secrets.SLACK_WEBHOOK }}
default_channel: "#ai-alerts"
email:
smtp_config: ${{ secrets.SMTP_CONFIG }}
pagerduty:
integration_key: ${{ secrets.PAGERDUTY_KEY }}
notification_rules:
regression_detected:
severity: "high"
channels: ["slack", "email"]
template: |
🚨 Quality regression detected in {{experiment.name}}
**Metrics affected:**
{{#each failed_metrics}}
- {{name}}: {{current_value}} ({{change}} from baseline)
{{/each}}
**Failing examples:** {{failing_examples_count}}
[View detailed results]({{experiment.url}})
**Action required:** Review changes in {{pr.url}}
deployment_blocked:
severity: "critical"
channels: ["slack", "pagerduty"]
escalation:
- after: "30m"
to: ["email:tech-leads@company.com"]
- after: "2h"
to: ["pagerduty:critical"]
performance_trend:
severity: "medium"
channels: ["slack"]
frequency: "daily_digest"
conditions:
- metric_degradation_over_days: 7
- threshold: 5%
Reporting and Visibility
Generate automated reports that provide stakeholders with clear visibility into AI system performance trends, deployment success rates, and quality metrics over time. Good reporting builds confidence in automated processes and enables data-driven decisions.