Docs / CI/CD for experiments

CI/CD for experiments

Build robust deployment pipelines by integrating experiment evaluation directly into your CI/CD workflow. Automatically catch performance regressions, enforce quality standards, and maintain historical baselines across all code changes.

CI/CD integration transforms evaluation from a manual process into an automated safety net that runs with every code change. This shift-left approach to AI quality ensures that only improvements reach production, reducing the risk of deploying models that perform worse than existing baselines.

Automated experiment pipelines enable faster iteration cycles while maintaining rigorous quality standards. Teams can confidently make changes knowing that comprehensive evaluation will catch any unintended consequences before they impact users.

CI/CD pipeline dashboard showing experiment results, quality gates, and deployment status across multiple environments

Pipeline Integration Patterns

Design CI/CD workflows that automatically trigger experiments when AI-related code changes, from prompt modifications to model updates. Different integration patterns serve different team needs and risk tolerance levels.

1
Pull request validation Run lightweight experiments on code changes to catch obvious regressions before merge.
2
Merge-to-main evaluation Execute comprehensive experiments after merge to validate integration quality.
3
Pre-deployment testing Run full evaluation suites before promoting changes to staging or production environments.
4
Continuous monitoring Ongoing evaluation in production to detect drift and performance degradation over time.

Info

Layered Approach: Use fast, lightweight checks for PR validation and comprehensive evaluations for deployment gates. This balances speed with thoroughness.

GitHub Actions workflow example

name: AI Model Quality Gate
on:
  pull_request:
    paths: ['prompts/**', 'models/**', 'evaluators/**']
  push:
    branches: [main]

jobs:
  quick-validation:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Quick Evaluation
        env:
          EVALIGO_API_KEY: ${{ secrets.EVALIGO_API_KEY }}
        run: |
          evaligo experiment run \
            --config .evaligo/pr-validation.yaml \
            --dataset smoke-tests \
            --baseline main \
            --timeout 10m \
            --fail-on-regression

  comprehensive-evaluation:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Full Evaluation Suite
        env:
          EVALIGO_API_KEY: ${{ secrets.EVALIGO_API_KEY }}
        run: |
          evaligo experiment run \
            --config .evaligo/comprehensive.yaml \
            --dataset production-regression-tests \
            --parallel 4 \
            --generate-report \
            --slack-notify #ai-team

      - name: Update Baseline
        if: success()
        run: |
          evaligo baselines update \
            --experiment ${{ github.sha }} \
            --environment staging

Quality Gates and Thresholds

Define automated decision criteria that determine whether changes can proceed through the pipeline. Quality gates provide objective, consistent evaluation criteria that reduce the need for manual review while maintaining high standards.

Effective threshold management balances sensitivity with practicality. Too strict thresholds block legitimate improvements, while too lenient thresholds fail to catch meaningful regressions. Establish thresholds based on historical performance and business impact analysis.

Quality gate configuration

# .evaligo/quality-gates.yaml
gates:
  pr_validation:
    description: "Fast checks for pull requests"
    thresholds:
      accuracy_regression: 
        max_decrease: 2%
        confidence_level: 90%
      latency_p95:
        max_increase: 200ms
      cost_per_request:
        max_increase: 15%
    evaluators:
      - groundedness_fast
      - toxicity_basic
      - coherence_simple
    
  deployment_gate:
    description: "Comprehensive pre-deployment validation"
    thresholds:
      accuracy_regression:
        max_decrease: 1%
        confidence_level: 95%
      hallucination_rate:
        max_rate: 0.05
      user_satisfaction_proxy:
        min_score: 0.85
      latency_p99:
        max_value: 3000ms
      cost_efficiency:
        min_accuracy_per_dollar: 10.0
    evaluators:
      - groundedness_comprehensive
      - toxicity_advanced  
      - domain_specific_quality
      - user_intent_alignment
    failure_action: "block_deployment"
    success_action: "update_baseline"

Baseline Management

Maintain rolling baselines that adapt to legitimate improvements while catching genuine regressions. Baseline management ensures that quality comparisons remain meaningful as your AI system evolves over time.

Automated baseline updates prevent drift where gradually declining performance becomes the new normal. By systematically tracking and updating baselines, teams maintain awareness of their system's true performance trajectory.

Video

Baseline Management Strategy

Learn how to set up automated baseline updates and manage performance expectations across different environments.

6m 10s

1
Environment-specific baselines Maintain separate baselines for development, staging, and production to account for environmental differences.
2
Automated updates Update baselines automatically when deployments pass all quality gates and show sustained improvement.
3
Rollback protection Preserve baseline history to enable rollback when new changes cause unexpected regressions.
4
Trend monitoring Track baseline evolution over time to identify gradual performance changes and system drift.

Baseline management automation

# Automated baseline update script
import evaligo
from datetime import datetime, timedelta

client = evaligo.Client()

# Get recent successful deployments
recent_deployments = client.deployments.list(
    status="success",
    since=datetime.now() - timedelta(days=7),
    environment="production"
)

# Find candidates for baseline update
for deployment in recent_deployments:
    experiment = deployment.experiment
    current_baseline = client.baselines.get_current("production")
    
    # Check if improvement is sustained
    improvement_metrics = experiment.compare_to_baseline(current_baseline)
    
    if (improvement_metrics.accuracy.improvement > 0.02 and 
        improvement_metrics.statistical_significance > 0.95 and
        deployment.uptime_hours > 72):  # 3 days stable
        
        # Update baseline
        new_baseline = client.baselines.create(
            name=f"production-baseline-{deployment.version}",
            experiment_id=experiment.id,
            metrics=experiment.get_metrics(),
            metadata={
                "deployment_date": deployment.created_at,
                "improvement_summary": improvement_metrics.summary,
                "previous_baseline": current_baseline.id
            }
        )
        
        # Notify team
        client.notifications.send(
            channel="#ai-team",
            message=f"🎯 Baseline updated! New production baseline shows {improvement_metrics.accuracy.improvement:.1%} accuracy improvement."
        )

Notification and Alerting

Configure intelligent notifications that provide actionable information when quality gates fail or performance trends change. Effective alerting reduces noise while ensuring that important issues receive immediate attention.

Info

Alert Fatigue: Configure notifications carefully to avoid overwhelming teams with false positives. Use different channels and urgency levels for different types of issues.

Notification settings showing different alert types, channels, and escalation rules for various failure scenarios

Notification configuration

# .evaligo/notifications.yaml
channels:
  slack:
    webhook_url: ${{ secrets.SLACK_WEBHOOK }}
    default_channel: "#ai-alerts"
    
  email:
    smtp_config: ${{ secrets.SMTP_CONFIG }}
    
  pagerduty:
    integration_key: ${{ secrets.PAGERDUTY_KEY }}

notification_rules:
  regression_detected:
    severity: "high"
    channels: ["slack", "email"]
    template: |
      🚨 Quality regression detected in {{experiment.name}}
      
      **Metrics affected:**
      {{#each failed_metrics}}
      - {{name}}: {{current_value}} ({{change}} from baseline)
      {{/each}}
      
      **Failing examples:** {{failing_examples_count}}
      [View detailed results]({{experiment.url}})
      
      **Action required:** Review changes in {{pr.url}}
    
  deployment_blocked:
    severity: "critical"
    channels: ["slack", "pagerduty"]
    escalation:
      - after: "30m"
        to: ["email:tech-leads@company.com"]
      - after: "2h" 
        to: ["pagerduty:critical"]
        
  performance_trend:
    severity: "medium"
    channels: ["slack"]
    frequency: "daily_digest"
    conditions:
      - metric_degradation_over_days: 7
      - threshold: 5%

Reporting and Visibility