Accelerate prompt development by comparing multiple variants side-by-side in real-time. See exactly how different approaches perform on the same inputs, making it easy to identify the best solutions quickly and objectively.
Side-by-side comparison eliminates the guesswork from prompt optimization by providing immediate, visual feedback on how different approaches perform. This parallel testing approach is especially valuable during rapid iteration phases where small changes can have significant impacts.
The synchronized comparison environment ensures fair testing by maintaining identical conditions across all variants. This controlled approach helps teams make confident decisions based on direct performance comparisons rather than sequential testing that can be influenced by changing conditions.

Setting Up Comparisons
Configure comparison sessions by selecting variants to test and defining the inputs you want to evaluate. The interface provides flexible options for testing different aspects of your prompts while maintaining experimental rigor.
- 1
Select prompt variants Choose 2-6 variants to compare, from existing saved views or create new ones on the fly.
- 2
Configure test inputs Define the test cases or use existing datasets to ensure comprehensive coverage of your use case.
- 3
Lock shared parameters Synchronize model settings, temperature, and other parameters across variants to ensure fair comparison.
- 4
Set evaluation criteria Choose evaluators and metrics that will highlight the differences that matter for your use case.
Variant Organization: Label your variants clearly (e.g., "baseline", "with-examples", "formal-tone") to make comparison results easier to interpret and share with team members.
Video

Real-time Analysis
Observe outputs as they generate in real-time, allowing you to spot differences immediately and adjust your testing approach. Real-time feedback enables rapid iteration and helps you focus on the most promising directions.
The synchronized execution ensures that all variants process the same inputs under identical conditions, eliminating variables that could skew results. This parallel processing approach provides the most accurate comparison possible.

# Set up a side-by-side comparison programmatically
comparison = client.playground.create_comparison(
name="customer-support-tone-variants",
variants=[
{
"name": "formal-professional",
"prompt": "You are a professional customer service representative...",
"temperature": 0.3,
"model": "gpt-4-turbo"
},
{
"name": "friendly-casual",
"prompt": "You're a helpful and friendly support agent...",
"temperature": 0.5,
"model": "gpt-4-turbo"
},
{
"name": "empathy-focused",
"prompt": "You are an empathetic customer support specialist...",
"temperature": 0.4,
"model": "gpt-4-turbo"
}
],
shared_parameters={
"max_tokens": 300,
"top_p": 0.9,
"frequency_penalty": 0.1
},
test_inputs=[
{"customer_query": "I've been charged twice for my subscription"},
{"customer_query": "My order hasn't arrived and I need it urgently"},
{"customer_query": "I want to cancel my account immediately"}
]
)
# Run the comparison
results = comparison.execute()
# Analyze differences
for input_case in results.inputs:
print(f"Input: {input_case.query}")
for variant in input_case.outputs:
print(f" {variant.name}: {variant.output[:100]}...")
print(f" Quality scores: {variant.evaluation_scores}")
print("---")Difference Highlighting
Automatic highlighting identifies key differences between variant outputs, helping you quickly understand how changes in prompts affect response quality, style, and content. This visual analysis accelerates the identification of optimal approaches.
Intelligent difference detection goes beyond simple text comparison to highlight semantic differences, tone variations, and factual discrepancies. This analysis helps teams understand not just what changed, but how those changes impact user experience.
- 1
Content differences Highlight variations in factual content, recommendations, and specific details across responses.
- 2
Style variations Identify differences in tone, formality, length, and communication approach between variants.
- 3
Quality indicators Visual cues show which responses score higher on key metrics like accuracy, helpfulness, and appropriateness.
- 4
Pattern recognition Identify systematic differences that emerge across multiple test cases for deeper insights.
Evaluation Integration: Automatic evaluators run in the background during comparison, providing real-time quality scores that complement your qualitative analysis.
Statistical Analysis
Beyond visual comparison, the platform provides statistical analysis of performance differences to help you make data-driven decisions about which variants perform best across your test cases.

# Analyze statistical significance of variant differences
analysis = comparison.analyze_statistical_significance(
metrics=["accuracy", "helpfulness", "appropriateness"],
confidence_level=0.95
)
# Review results
for metric in analysis.metrics:
print(f"\n{metric.name} Analysis:")
print(f"Best performer: {metric.best_variant} (score: {metric.best_score:.3f})")
# Check if differences are statistically significant
for comparison_pair in metric.pairwise_comparisons:
if comparison_pair.p_value < 0.05:
print(f" {comparison_pair.variant_a} vs {comparison_pair.variant_b}:")
print(f" Difference: {comparison_pair.effect_size:.3f}")
print(f" Confidence: {comparison_pair.confidence_interval}")
print(f" Significance: p = {comparison_pair.p_value:.4f}")
# Generate recommendation
recommendation = analysis.get_recommendation(
criteria={
"accuracy": 0.4, # 40% weight
"helpfulness": 0.35, # 35% weight
"appropriateness": 0.25 # 25% weight
}
)
print(f"\nRecommendation: Use '{recommendation.variant}' variant")
print(f"Confidence: {recommendation.confidence:.1%}")
print(f"Reasoning: {recommendation.reasoning}")Export and Sharing
Export comparison results in multiple formats for stakeholder review, documentation, or integration into experiment tracking systems. Share insights effectively with both technical and non-technical team members.
Data Sensitivity: When sharing comparison results, be mindful of any sensitive data in your test inputs and consider anonymizing or filtering before external sharing.
Video

Integration with Experiments
Seamlessly transition winning variants from playground comparison to formal experiments for comprehensive evaluation on larger datasets. This workflow ensures that promising directions discovered in the playground receive thorough validation.
