Docs / Run an experiment

Run an experiment

Experiments are the heart of systematic AI improvement. Instead of guessing which prompts or models work better, experiments provide objective, data-driven comparisons that help you optimize performance with confidence.

Effective experimentation goes beyond just testing different prompts. It involves strategic variant design, careful parameter configuration, comprehensive evaluation, and thoughtful analysis that leads to actionable insights for your AI application.

This guide covers the complete experimental workflow from hypothesis formation to results analysis, helping you build a disciplined approach to AI optimization that scales from initial development through production deployment.

Whether you're optimizing response quality, reducing costs, improving consistency, or exploring new capabilities, systematic experimentation provides the evidence you need to make informed decisions about your AI application.

Experimental Design Principles

Good experiments start with clear hypotheses and controlled comparisons. Define what you want to test, what you expect to happen, and how you'll measure success before running your experiment.

Change one variable at a time when possible to isolate the impact of specific modifications. This makes results easier to interpret and insights more actionable for future optimization efforts.

Info

Treat each experiment as a scientific study. Form hypotheses, control variables, measure outcomes objectively, and draw evidence-based conclusions that guide your next steps.

Setting Up Your First Experiment

Experiment setup involves selecting datasets, configuring variants, choosing evaluation metrics, and setting execution parameters. Each decision affects the quality and interpretability of your results.

1
Select Dataset Choose representative test cases that align with your experimental goals and hypotheses.
2
Design Variants Create prompt or model variations that test specific hypotheses about performance.
3
Configure Evaluation Select metrics and evaluators that measure the qualities you care about most.
4
Set Execution Parameters Configure sampling, parallelism, and cost controls for reliable results.

Dataset Selection and Sampling

Choose datasets that represent the scenarios you want to optimize for. Different datasets can reveal different aspects of model performance, so select thoughtfully based on your experimental goals.

For quick iteration, start with smaller challenge sets (20-50 examples) that include your most important or challenging cases. For comprehensive evaluation, use larger datasets that provide statistical confidence.

Sampling Strategies

When working with large datasets, strategic sampling can provide reliable insights while controlling costs and execution time.

// Example experiment configuration
{
  "name": "Prompt Optimization - Customer Support",
  "hypothesis": "Adding examples improves response quality for complex queries",
  "dataset": {
    "id": "customer-support-q4-2024",
    "sampling": {
      "strategy": "stratified",
      "size": 100,
      "strata": ["difficulty", "category"],
      "filters": {
        "difficulty": ["medium", "hard"],
        "category": ["billing", "technical"]
      }
    }
  },
  "variants": [
    {
      "name": "baseline",
      "prompt": "You are a helpful customer support agent. Answer this question: {{input}}"
    },
    {
      "name": "with-examples", 
      "prompt": "You are a helpful customer support agent. Here are some examples:\n{{examples}}\n\nAnswer this question: {{input}}"
    }
  ],
  "model": {
    "provider": "openai",
    "name": "gpt-4",
    "temperature": 0.3,
    "max_tokens": 500
  }
}

Dataset selection and sampling interface

Designing Effective Variants

Variants should test specific hypotheses about what improves performance. Each variant should represent a different approach to solving the same problem, enabling direct comparison of effectiveness.

Start with simple comparisons before testing complex variations. A baseline variant establishes your current performance, while test variants explore specific improvements or alternatives.

Common Variant Types

Different types of variants test different aspects of AI performance and behavior.

Prompt Variants

Test different instruction styles, example inclusion, formatting approaches, or reasoning strategies to optimize prompt effectiveness.

Model Variants

Compare different models (GPT-4 vs Claude vs Llama) to understand performance-cost tradeoffs for your specific use case.

Parameter Variants

Experiment with temperature, top-p, max tokens, or other generation parameters to fine-tune model behavior.

Tip

Design variants that isolate specific changes. If you modify both the prompt structure AND the temperature, you won't know which change caused any performance differences you observe.

Evaluation Configuration

Choose evaluation metrics that align with your quality goals and business requirements. Different metrics reveal different aspects of performance, so select a comprehensive set that covers your most important quality dimensions.

Combine automatic evaluators (for consistency and scale) with human evaluation (for nuanced judgment) to get a complete picture of quality across different dimensions.

// Example evaluation configuration
{
  "evaluators": [
    {
      "name": "factual-accuracy",
      "type": "built-in",
      "weight": 0.4,
      "config": {
        "reference_field": "expected_output",
        "threshold": 0.8
      }
    },
    {
      "name": "helpfulness",
      "type": "llm-judge",
      "weight": 0.3,
      "config": {
        "judge_model": "gpt-4",
        "criteria": "Rate how helpful this response is for solving the user's problem",
        "scale": "1-10"
      }
    },
    {
      "name": "customer-service-quality",
      "type": "custom",
      "weight": 0.3,
      "config": {
        "evaluator_id": "cs-quality-v2",
        "check_empathy": true,
        "check_solution_focus": true
      }
    }
  ]
}

Execution and Monitoring

Experiments execute in parallel with automatic retries and comprehensive tracking of costs, performance, and errors. Monitor progress in real-time and investigate any issues that arise during execution.

Evaligo handles the technical complexity of running experiments at scale while providing visibility into execution progress, cost accumulation, and any errors that occur.

Video

Running experiments: from configuration to results analysis

Cost Management

Set cost limits and monitoring to prevent unexpected expenses. Experiments can consume significant API credits, especially when testing multiple variants against large datasets.

Error Handling

Monitor for API errors, timeout issues, and evaluation failures. Evaligo provides automatic retries for transient failures while flagging persistent issues for investigation.

Experiment execution monitoring dashboard

Results Analysis and Interpretation

Effective analysis goes beyond looking at aggregate scores. Drill into specific examples, segment results by metadata categories, and understand why certain variants perform better in different contexts.

Look for patterns in the data that inform future optimization efforts. Which types of inputs benefit most from your changes? Are there edge cases where performance degrades? What do the failure modes tell you?

Quantitative Analysis

Compare aggregate scores, statistical significance, performance distributions, and cost efficiency across variants.

Qualitative Analysis

Review specific examples to understand quality differences, identify failure patterns, and generate hypotheses for future experiments.

Segmented Analysis

Break down results by metadata categories to understand how performance varies across different types of inputs or user scenarios.

Warning

Ensure adequate sample sizes for meaningful comparisons. Small differences in scores may not be statistically significant, especially with limited test cases.

Iterative Improvement Process

Use experiment results to inform your next optimization cycle. Successful variants become new baselines, while insights from failures guide future experimental directions.

Document your findings and share insights with your team. Experimental knowledge is valuable organizational learning that should be preserved and built upon over time.

Next Steps

With systematic experimentation in place, you can confidently optimize your AI application's performance and build a culture of data-driven decision making.

Compare Experiment Results

Analyze and interpret experimental outcomes

Log Experiment Runs

Track and version your experimental work

CI/CD Integration

Automate experiments in your development pipeline

Advanced Evaluations

Build sophisticated quality assessment