Docs / Compare results

Compare results

Make informed decisions about model and prompt changes by comparing experiment results side-by-side. Understand performance differences across metrics, drill into specific examples, and share findings with stakeholders.

Effective comparison workflows help teams move beyond intuition to data-driven decision making. By systematically analyzing differences between experiments, you can identify which changes truly improve performance and avoid regressions that might not be immediately obvious.

Evaligo's comparison tools enable both high-level metric analysis and detailed, row-by-row examination of results. This multi-level approach ensures you catch both statistical trends and nuanced changes in output quality that aggregate metrics might miss.

Evaluator Scores Comparison

Visualize how each evaluator scores your AI outputs across multiple criteria, enabling granular quality analysis and targeted improvements.

Metric-Level Comparison

Start with high-level metrics to understand overall performance differences between experiments. Look for statistically significant changes in accuracy, latency, cost, and custom evaluation scores to guide your analysis.

  1. 1

    Select experiments to compare Choose baseline and candidate experiments, ensuring they use the same dataset for fair comparison.

  2. 2

    Review aggregate metrics Examine accuracy, F1 scores, latency percentiles, and cost per request to identify significant differences.

  3. 3

    Check statistical significance Use confidence intervals and p-values to determine if observed differences are meaningful.

  4. 4

    Identify performance trade-offs Understand relationships between accuracy, cost, and latency to make informed decisions.

Tip

Statistical Significance: Look for confidence intervals that don't overlap and p-values below your significance threshold (typically 0.05) to ensure observed differences aren't due to random variation.

Programmatic comparison analysis
# Compare two experiments programmatically
baseline = client.experiments.get("gpt-35-baseline-v1")
candidate = client.experiments.get("gpt-4-optimized-v2")

comparison = client.experiments.compare(
    baseline_id=baseline.id,
    candidate_id=candidate.id,
    metrics=["accuracy", "latency_p95", "cost_per_request"],
    confidence_level=0.95
)

# Check for significant improvements
if comparison.accuracy.is_significant and comparison.accuracy.delta > 0.05:
    print(f"Significant accuracy improvement: {comparison.accuracy.delta:.2%}")
    print(f"Confidence interval: {comparison.accuracy.confidence_interval}")

# Analyze trade-offs
if comparison.cost_per_request.delta > 0.20:  # 20% cost increase
    cost_per_accuracy = comparison.cost_per_request.delta / comparison.accuracy.delta
    print(f"Cost per accuracy point: ${cost_per_accuracy:.4f}")

Row-Level Analysis

Drill down into individual test cases to understand how changes affect specific types of inputs. This qualitative analysis often reveals patterns that aggregate metrics miss and helps identify edge cases where the model struggles.

Row-level comparison is particularly valuable for understanding changes in output style, reasoning quality, or handling of edge cases. By examining specific examples where performance differs, you can gain insights into the underlying model behavior changes.

AI-Generated Prompt Optimization

Leverage Evaly to generate optimized prompts tailored to your task, boosting evaluation scores and model performance with AI-guided suggestions.
  1. 1

    Filter to interesting cases Focus on rows where experiments disagree or where evaluator scores differ significantly.

  2. 2

    Examine output differences Compare response quality, style, factual accuracy, and reasoning between variants.

  3. 3

    Review evaluator feedback Understand why evaluators scored outputs differently and identify patterns in feedback.

  4. 4

    Categorize findings Group similar differences to understand systematic changes in model behavior.

Info

Sampling Strategy: Focus on rows with the largest score differences, disagreements between evaluators, or cases that represent critical business scenarios for your application.

Video

Row-Level Analysis Workflow
Row-Level Analysis Workflow
Learn how to effectively drill into individual test cases and identify patterns in model behavior changes.
3m 45s

Sharing and Documentation

Export comparison results and create shareable reports for stakeholders. Document your findings, decisions, and rationale to maintain transparency and enable future teams to understand the evolution of your AI system.

Well-documented comparisons serve as valuable historical records that help teams understand why certain decisions were made and provide context for future optimization efforts. This documentation is especially important for regulatory compliance and audit requirements.

Comparison report export interface with options for PDF, CSV, and interactive dashboard sharing
Generating comparison reports
# Generate a comprehensive comparison report
report = comparison.generate_report(
    title="GPT-4 Migration Impact Analysis",
    description="Evaluating accuracy and cost trade-offs for customer support automation",
    include_sections=[
        "executive_summary",
        "metric_comparison", 
        "row_level_analysis",
        "recommendations"
    ],
    sample_size=50,  # Include 50 example rows
    format="pdf"
)

# Share with stakeholders
report.share_with([
    "product@company.com",
    "engineering@company.com", 
    "ai-team@company.com"
])

# Attach to pull request
github_integration.comment_on_pr(
    pr_number=123,
    content=f"""
    ## 🔬 Experiment Comparison Results
    
    **Accuracy Change:** {comparison.accuracy.delta:+.1%} ({comparison.accuracy.significance})
    **Cost Impact:** {comparison.cost_per_request.delta:+.1%}
    **Latency Change:** {comparison.latency_p95.delta:+.0f}ms
    
    [View detailed analysis]({report.url})
    
    **Recommendation:** {report.recommendation}
    """
)

Automated Decision Making

Set up automated rules and thresholds to streamline decision making for routine comparisons. This enables faster iteration cycles while maintaining quality standards through systematic evaluation criteria.

Warning

Automation Guidelines: Start with conservative thresholds and gradually refine them based on historical performance. Always include manual review steps for critical decisions or significant changes.

Automated decision rules
# Define decision criteria
decision_rules = {
    "auto_approve": {
        "accuracy_improvement": ">= 2%",
        "cost_increase": "<= 10%", 
        "latency_increase": "<= 200ms",
        "statistical_significance": "p < 0.05"
    },
    "auto_reject": {
        "accuracy_decrease": ">= 1%",
        "cost_increase": ">= 25%",
        "latency_increase": ">= 500ms"
    },
    "requires_review": {
        "mixed_results": True,
        "edge_case_regressions": "> 5 cases"
    }
}

# Apply rules to comparison
decision = comparison.apply_rules(decision_rules)

if decision.outcome == "auto_approve":
    print(f"✅ Automatically approved: {decision.reasoning}")
    deployment.promote_to_production(candidate.id)
elif decision.outcome == "auto_reject":
    print(f"❌ Automatically rejected: {decision.reasoning}")
else:
    print(f"⏸️ Manual review required: {decision.reasoning}")
    slack.notify_team(f"Experiment comparison needs review: {comparison.url}")

Related Documentation

Run Experiments
Learn how to set up and execute experiments
Log Experiment Runs
Programmatically track experiment results
CI/CD for Experiments
Integrate comparison into your deployment pipeline
Custom Evaluations
Build domain-specific quality metrics