Accelerate prompt development by comparing multiple variants side-by-side in real-time. See exactly how different approaches perform on the same inputs, making it easy to identify the best solutions quickly and objectively.

Side-by-side comparison eliminates the guesswork from prompt optimization by providing immediate, visual feedback on how different approaches perform. This parallel testing approach is especially valuable during rapid iteration phases where small changes can have significant impacts.

The synchronized comparison environment ensures fair testing by maintaining identical conditions across all variants. This controlled approach helps teams make confident decisions based on direct performance comparisons rather than sequential testing that can be influenced by changing conditions.

Side-by-side comparison interface showing three prompt variants with synchronized inputs and real-time output comparison

Setting Up Comparisons

Configure comparison sessions by selecting variants to test and defining the inputs you want to evaluate. The interface provides flexible options for testing different aspects of your prompts while maintaining experimental rigor.

  1. 1

    Select prompt variants Choose 2-6 variants to compare, from existing saved views or create new ones on the fly.

  2. 2

    Configure test inputs Define the test cases or use existing datasets to ensure comprehensive coverage of your use case.

  3. 3

    Lock shared parameters Synchronize model settings, temperature, and other parameters across variants to ensure fair comparison.

  4. 4

    Set evaluation criteria Choose evaluators and metrics that will highlight the differences that matter for your use case.

Info

Variant Organization: Label your variants clearly (e.g., "baseline", "with-examples", "formal-tone") to make comparison results easier to interpret and share with team members.

Video

Setting Up Effective Comparisons
Setting Up Effective Comparisons
Learn how to configure side-by-side comparisons for maximum insight, including variant selection and parameter synchronization.
4m 45s

Real-time Analysis

Observe outputs as they generate in real-time, allowing you to spot differences immediately and adjust your testing approach. Real-time feedback enables rapid iteration and helps you focus on the most promising directions.

The synchronized execution ensures that all variants process the same inputs under identical conditions, eliminating variables that could skew results. This parallel processing approach provides the most accurate comparison possible.

Real-time output comparison showing responses generating simultaneously with difference highlighting and quality indicators
Programmatic comparison setup
# Set up a side-by-side comparison programmatically
comparison = client.playground.create_comparison(
    name="customer-support-tone-variants",
    variants=[
        {
            "name": "formal-professional",
            "prompt": "You are a professional customer service representative...",
            "temperature": 0.3,
            "model": "gpt-4-turbo"
        },
        {
            "name": "friendly-casual", 
            "prompt": "You're a helpful and friendly support agent...",
            "temperature": 0.5,
            "model": "gpt-4-turbo"
        },
        {
            "name": "empathy-focused",
            "prompt": "You are an empathetic customer support specialist...",
            "temperature": 0.4,
            "model": "gpt-4-turbo"
        }
    ],
    shared_parameters={
        "max_tokens": 300,
        "top_p": 0.9,
        "frequency_penalty": 0.1
    },
    test_inputs=[
        {"customer_query": "I've been charged twice for my subscription"},
        {"customer_query": "My order hasn't arrived and I need it urgently"},
        {"customer_query": "I want to cancel my account immediately"}
    ]
)

# Run the comparison
results = comparison.execute()

# Analyze differences
for input_case in results.inputs:
    print(f"Input: {input_case.query}")
    for variant in input_case.outputs:
        print(f"  {variant.name}: {variant.output[:100]}...")
        print(f"  Quality scores: {variant.evaluation_scores}")
    print("---")

Difference Highlighting

Automatic highlighting identifies key differences between variant outputs, helping you quickly understand how changes in prompts affect response quality, style, and content. This visual analysis accelerates the identification of optimal approaches.

Intelligent difference detection goes beyond simple text comparison to highlight semantic differences, tone variations, and factual discrepancies. This analysis helps teams understand not just what changed, but how those changes impact user experience.

  1. 1

    Content differences Highlight variations in factual content, recommendations, and specific details across responses.

  2. 2

    Style variations Identify differences in tone, formality, length, and communication approach between variants.

  3. 3

    Quality indicators Visual cues show which responses score higher on key metrics like accuracy, helpfulness, and appropriateness.

  4. 4

    Pattern recognition Identify systematic differences that emerge across multiple test cases for deeper insights.

Info

Evaluation Integration: Automatic evaluators run in the background during comparison, providing real-time quality scores that complement your qualitative analysis.

Statistical Analysis

Beyond visual comparison, the platform provides statistical analysis of performance differences to help you make data-driven decisions about which variants perform best across your test cases.

Statistical analysis dashboard showing performance metrics, confidence intervals, and significance testing across variants
Statistical comparison analysis
# Analyze statistical significance of variant differences
analysis = comparison.analyze_statistical_significance(
    metrics=["accuracy", "helpfulness", "appropriateness"],
    confidence_level=0.95
)

# Review results
for metric in analysis.metrics:
    print(f"\n{metric.name} Analysis:")
    print(f"Best performer: {metric.best_variant} (score: {metric.best_score:.3f})")
    
    # Check if differences are statistically significant
    for comparison_pair in metric.pairwise_comparisons:
        if comparison_pair.p_value < 0.05:
            print(f"  {comparison_pair.variant_a} vs {comparison_pair.variant_b}:")
            print(f"    Difference: {comparison_pair.effect_size:.3f}")
            print(f"    Confidence: {comparison_pair.confidence_interval}")
            print(f"    Significance: p = {comparison_pair.p_value:.4f}")

# Generate recommendation
recommendation = analysis.get_recommendation(
    criteria={
        "accuracy": 0.4,  # 40% weight
        "helpfulness": 0.35,  # 35% weight  
        "appropriateness": 0.25  # 25% weight
    }
)

print(f"\nRecommendation: Use '{recommendation.variant}' variant")
print(f"Confidence: {recommendation.confidence:.1%}")
print(f"Reasoning: {recommendation.reasoning}")

Export and Sharing

Export comparison results in multiple formats for stakeholder review, documentation, or integration into experiment tracking systems. Share insights effectively with both technical and non-technical team members.

Info

Data Sensitivity: When sharing comparison results, be mindful of any sensitive data in your test inputs and consider anonymizing or filtering before external sharing.

Video

Sharing Comparison Insights
Sharing Comparison Insights
Best practices for documenting and sharing side-by-side comparison results with different stakeholder groups.
3m 20s

Integration with Experiments

Seamlessly transition winning variants from playground comparison to formal experiments for comprehensive evaluation on larger datasets. This workflow ensures that promising directions discovered in the playground receive thorough validation.

Integration workflow showing how playground comparisons can be promoted to formal experiments with expanded datasets

Related Documentation

Save and Manage Views
Preserve successful variants as reusable configurations
Optimize Prompts
Use guided optimization to improve your best variants
Run Experiments
Scale up successful variants to comprehensive experiments
Compare Results
Formal experiment comparison and analysis workflows