Docs / Run an experiment
Run an experiment
Experiments are the heart of systematic AI improvement. Instead of guessing which prompts or models work better, experiments provide objective, data-driven comparisons that help you optimize performance with confidence.
Effective experimentation goes beyond just testing different prompts. It involves strategic variant design, careful parameter configuration, comprehensive evaluation, and thoughtful analysis that leads to actionable insights for your AI application.
This guide covers the complete experimental workflow from hypothesis formation to results analysis, helping you build a disciplined approach to AI optimization that scales from initial development through production deployment.
Whether you're optimizing response quality, reducing costs, improving consistency, or exploring new capabilities, systematic experimentation provides the evidence you need to make informed decisions about your AI application.
Experimental Design Principles
Good experiments start with clear hypotheses and controlled comparisons. Define what you want to test, what you expect to happen, and how you'll measure success before running your experiment.
Change one variable at a time when possible to isolate the impact of specific modifications. This makes results easier to interpret and insights more actionable for future optimization efforts.


Setting Up Your First Experiment
Experiment setup involves selecting datasets, configuring variants, choosing evaluation metrics, and setting execution parameters. Each decision affects the quality and interpretability of your results.
- 1
Select Dataset Choose representative test cases that align with your experimental goals and hypotheses.
- 2
Design Variants Create prompt or model variations that test specific hypotheses about performance.
- 3
Configure Evaluation Select metrics and evaluators that measure the qualities you care about most.
- 4
Set Execution Parameters Configure sampling, parallelism, and cost controls for reliable results.
Dataset Selection and Sampling
Choose datasets that represent the scenarios you want to optimize for. Different datasets can reveal different aspects of model performance, so select thoughtfully based on your experimental goals.
For quick iteration, start with smaller challenge sets (20-50 examples) that include your most important or challenging cases. For comprehensive evaluation, use larger datasets that provide statistical confidence.
Sampling Strategies
When working with large datasets, strategic sampling can provide reliable insights while controlling costs and execution time.
// Example experiment configuration
{
"name": "Prompt Optimization - Customer Support",
"hypothesis": "Adding examples improves response quality for complex queries",
"dataset": {
"id": "customer-support-q4-2024",
"sampling": {
"strategy": "stratified",
"size": 100,
"strata": ["difficulty", "category"],
"filters": {
"difficulty": ["medium", "hard"],
"category": ["billing", "technical"]
}
}
},
"variants": [
{
"name": "baseline",
"prompt": "You are a helpful customer support agent. Answer this question: {{input}}"
},
{
"name": "with-examples",
"prompt": "You are a helpful customer support agent. Here are some examples:\n{{examples}}\n\nAnswer this question: {{input}}"
}
],
"model": {
"provider": "openai",
"name": "gpt-4",
"temperature": 0.3,
"max_tokens": 500
}
}


Designing Effective Variants
Variants should test specific hypotheses about what improves performance. Each variant should represent a different approach to solving the same problem, enabling direct comparison of effectiveness.
Start with simple comparisons before testing complex variations. A baseline variant establishes your current performance, while test variants explore specific improvements or alternatives.
Common Variant Types
Different types of variants test different aspects of AI performance and behavior.
Prompt Variants
Test different instruction styles, example inclusion, formatting approaches, or reasoning strategies to optimize prompt effectiveness.
Model Variants
Compare different models (GPT-4 vs Claude vs Llama) to understand performance-cost tradeoffs for your specific use case.
Parameter Variants
Experiment with temperature, top-p, max tokens, or other generation parameters to fine-tune model behavior.
Evaluation Configuration
Choose evaluation metrics that align with your quality goals and business requirements. Different metrics reveal different aspects of performance, so select a comprehensive set that covers your most important quality dimensions.
Combine automatic evaluators (for consistency and scale) with human evaluation (for nuanced judgment) to get a complete picture of quality across different dimensions.
// Example evaluation configuration
{
"evaluators": [
{
"name": "factual-accuracy",
"type": "built-in",
"weight": 0.4,
"config": {
"reference_field": "expected_output",
"threshold": 0.8
}
},
{
"name": "helpfulness",
"type": "llm-judge",
"weight": 0.3,
"config": {
"judge_model": "gpt-4",
"criteria": "Rate how helpful this response is for solving the user's problem",
"scale": "1-10"
}
},
{
"name": "customer-service-quality",
"type": "custom",
"weight": 0.3,
"config": {
"evaluator_id": "cs-quality-v2",
"check_empathy": true,
"check_solution_focus": true
}
}
]
}
Execution and Monitoring
Experiments execute in parallel with automatic retries and comprehensive tracking of costs, performance, and errors. Monitor progress in real-time and investigate any issues that arise during execution.
Evaligo handles the technical complexity of running experiments at scale while providing visibility into execution progress, cost accumulation, and any errors that occur.
Video

Cost Management
Set cost limits and monitoring to prevent unexpected expenses. Experiments can consume significant API credits, especially when testing multiple variants against large datasets.
Error Handling
Monitor for API errors, timeout issues, and evaluation failures. Evaligo provides automatic retries for transient failures while flagging persistent issues for investigation.


Results Analysis and Interpretation
Effective analysis goes beyond looking at aggregate scores. Drill into specific examples, segment results by metadata categories, and understand why certain variants perform better in different contexts.
Look for patterns in the data that inform future optimization efforts. Which types of inputs benefit most from your changes? Are there edge cases where performance degrades? What do the failure modes tell you?
Quantitative Analysis
Compare aggregate scores, statistical significance, performance distributions, and cost efficiency across variants.
Qualitative Analysis
Review specific examples to understand quality differences, identify failure patterns, and generate hypotheses for future experiments.
Segmented Analysis
Break down results by metadata categories to understand how performance varies across different types of inputs or user scenarios.
Iterative Improvement Process
Use experiment results to inform your next optimization cycle. Successful variants become new baselines, while insights from failures guide future experimental directions.
Document your findings and share insights with your team. Experimental knowledge is valuable organizational learning that should be preserved and built upon over time.
Next Steps
With systematic experimentation in place, you can confidently optimize your AI application's performance and build a culture of data-driven decision making.