Combine the power of Evaligo's prompt evaluation system with automated workflows to create a continuous improvement cycle. Evaluate flow outputs, identify issues, and iterate on prompts—all in one platform.

The Continuous Improvement Cycle

1. Build Flow
   ↓ Create workflow with prompts
   ↓ Deploy for production use
   
2. Collect Results
   ↓ Flow processes data
   ↓ Saves outputs to dataset
   
3. Evaluate Quality
   ↓ Run evaluators on outputs
   ↓ Measure success metrics
   
4. Analyze & Improve
   ↓ Identify failure patterns
   ↓ Refine prompts in Playground
   ↓ Update flow with better prompts
   
5. Repeat
   ↓ Continuous monitoring
   ↓ Ongoing optimization

Setting Up Evaluation

Save Flow Outputs

Flow with Dataset Sink:
  Prompt Node
    → out: AI-generated content
  Dataset Sink
    → Save: input, output, metadata
    
Dataset contains:
  - input: Original data
  - output: AI response
  - timestamp: When processed
  - flow_version: Which flow version

Create Evaluators

In the Evaluation section, create evaluators for your flow outputs:

Evaluator: "Summary Quality"
  Input: output (from flow)
  Criteria:
    - Is factually accurate
    - Captures key points
    - Appropriate length
    - Professional tone
  Score: 1-5
  
Evaluator: "Response Relevance"
  Prompt: "Does this response address the input?"
  Input: input, output
  Score: Pass/Fail

Run Evaluations

1. Select dataset with flow outputs
2. Choose evaluators to run
3. Execute evaluation
4. Review scores and feedback
Tip
Create evaluators that match your quality criteria. Good evaluators catch issues before your users do.

Evaluation Patterns

Batch Evaluation

Run evaluations on flow output datasets:

Flow processes 100 items
  → Saves to "Flow Outputs" dataset
  
Periodic evaluation:
  → Run evaluators on "Flow Outputs"
  → Generate quality report
  → Identify low-scoring items
  → Analyze failure patterns

Inline Evaluation

Add evaluation directly in the flow:

Prompt 1: Generate content
  ↓
Prompt 2: Evaluate quality
  → "Rate this content 1-5"
  ↓
Conditional: If score < 3
  → Regenerate with different approach
  ↓
Dataset Sink: Save only high-quality outputs

Sampling Strategy

For high-volume flows:
  Save: Every output
  Evaluate: Random 10% sample
  
Sufficient for:
  - Monitoring trends
  - Detecting quality issues
  - Lower evaluation costs

Quality Metrics

Track Over Time

Week 1
3.2/5
Week 2
↑ improved prompt
3.8/5
Week 3
↑ refined again
4.1/5
Week 4
✓ stable
4.2/5

Key Metrics

  • Average Score: Overall quality
  • Pass Rate: % meeting threshold
  • Failure Patterns: Common issues
  • Edge Cases: Unexpected inputs
  • Cost per Quality Point: Efficiency

Dashboard View

Website Analyzer

Flow Performance Overview

Total Executions
1,234
Average Quality Score
4.2/5
Pass Rate
87%
Cost per Run
$0.42
Top Issues
Timeout on slow sites
8%
Incomplete summaries
5%
Incorrect categorization
3%
Info
Set quality thresholds for your flows. Alert when scores drop below acceptable levels.

Improvement Workflow

Identify Issues

Evaluation shows:
  - 15% of outputs score < 3/5
  - Common issue: Missing key details
  - Pattern: Short input texts
  
Root cause:
  Prompt doesn't handle brief inputs well

Refine in Playground

1. Export low-scoring examples
2. Load into Playground
3. Test prompt variations
4. Run experiments with evaluators
5. Find better prompt
6. Measure improvement

Update Flow

1. Update prompt in Flow
2. Test with sample data
3. Deploy new version
4. Monitor quality metrics
5. Verify improvement

A/B Testing

Run two flow versions:
  Version A: Original prompt (50% traffic)
  Version B: New prompt (50% traffic)
  
Compare results:
  Version A: 3.8/5 average
  Version B: 4.3/5 average ✓
  
Roll out: Deploy Version B to 100%

Automated Quality Checks

Self-Evaluation Flow

Prompt 1: "Generate product description"
  out: description
  ↓
Prompt 2: "Rate this description 1-5"
  input: description
  out: score, feedback
  ↓
If score >= 4:
  → Dataset Sink (approved)
If score < 4:
  → Dataset Sink (needs review)

Multi-Stage Quality Gates

Stage 1: Generate content
Stage 2: Check grammar/spelling
Stage 3: Verify factual accuracy
Stage 4: Assess tone/style
  
Only outputs passing all gates are saved

Human-in-the-Loop

Flow outputs
  → Auto-evaluate
  → If score < threshold:
      → Flag for human review
      → Save to review queue
  → If score >= threshold:
      → Auto-approve
      → Save to production dataset
Warning
Don't rely solely on automated evaluation for critical use cases. Combine AI evaluation with periodic human review.

Cost-Quality Tradeoffs

Optimization Strategies

Option 1: Highest Quality
  Model: GPT-4
  Temperature: 0.3
  Evaluation: Comprehensive
  Cost: $1.50 per run
  Quality: 4.5/5
  
Option 2: Balanced
  Model: GPT-4
  Temperature: 0.7
  Evaluation: Key metrics only
  Cost: $0.75 per run
  Quality: 4.2/5
  
Option 3: Cost-Optimized
  Model: GPT-3.5
  Temperature: 0.7
  Evaluation: Sampling
  Cost: $0.15 per run
  Quality: 3.8/5

Finding the Sweet Spot

  • Measure quality vs cost
  • Identify minimum acceptable quality
  • Optimize prompt to reduce tokens
  • Use cheaper models where possible
  • Evaluate only what matters

Real-World Example

Customer Support Automation

Flow: Ticket Response Generator
  Input: Customer question
  Output: Suggested response
  
Initial Performance:
  Quality: 3.5/5
  Approval Rate: 60%
  Issue: Too generic, misses details
  
Improvement Cycle:
  Week 1: Add context to prompt
    → Quality: 3.9/5
  Week 2: Use examples in prompt
    → Quality: 4.2/5
  Week 3: Add validation step
    → Quality: 4.4/5
  Week 4: Refine edge cases
    → Quality: 4.5/5
  
Final Result:
  Approval Rate: 85%
  Response Time: 2s (was 2 hours)
  Cost: $0.08 per ticket
  Customer Satisfaction: +15%

Best Practices

1. Evaluate Early and Often

  • Don't wait until production
  • Test with evaluators during development
  • Catch issues before deployment
  • Iterate quickly

2. Use Multiple Evaluators

  • Accuracy evaluator
  • Completeness evaluator
  • Tone/style evaluator
  • Format validator

3. Monitor Production Quality

Set up alerts:
  - Quality drops below 4.0
  - Pass rate < 80%
  - Error rate > 5%
  - Cost increase > 20%

4. Version Everything

  • Track prompt versions
  • Link evaluations to flow versions
  • Enable rollback if needed
  • Document changes and improvements
The best AI systems are continuously evaluated and improved. Make evaluation a core part of your workflow, not an afterthought.

Related Documentation

Using Prompts in Flows
Integrate tested prompts
Custom Evaluations
Build quality checks
Using Datasets
Store and evaluate outputs