Combine the power of Evaligo's prompt evaluation system with automated workflows to create a continuous improvement cycle. Evaluate flow outputs, identify issues, and iterate on prompts—all in one platform.
The Continuous Improvement Cycle
1. Build Flow
↓ Create workflow with prompts
↓ Deploy for production use
2. Collect Results
↓ Flow processes data
↓ Saves outputs to dataset
3. Evaluate Quality
↓ Run evaluators on outputs
↓ Measure success metrics
4. Analyze & Improve
↓ Identify failure patterns
↓ Refine prompts in Playground
↓ Update flow with better prompts
5. Repeat
↓ Continuous monitoring
↓ Ongoing optimizationSetting Up Evaluation
Save Flow Outputs
Flow with Dataset Sink:
Prompt Node
→ out: AI-generated content
Dataset Sink
→ Save: input, output, metadata
Dataset contains:
- input: Original data
- output: AI response
- timestamp: When processed
- flow_version: Which flow versionCreate Evaluators
In the Evaluation section, create evaluators for your flow outputs:
Evaluator: "Summary Quality"
Input: output (from flow)
Criteria:
- Is factually accurate
- Captures key points
- Appropriate length
- Professional tone
Score: 1-5
Evaluator: "Response Relevance"
Prompt: "Does this response address the input?"
Input: input, output
Score: Pass/FailRun Evaluations
1. Select dataset with flow outputs
2. Choose evaluators to run
3. Execute evaluation
4. Review scores and feedbackTip
Create evaluators that match your quality criteria. Good evaluators catch issues before your users do.
Evaluation Patterns
Batch Evaluation
Run evaluations on flow output datasets:
Flow processes 100 items
→ Saves to "Flow Outputs" dataset
Periodic evaluation:
→ Run evaluators on "Flow Outputs"
→ Generate quality report
→ Identify low-scoring items
→ Analyze failure patternsInline Evaluation
Add evaluation directly in the flow:
Prompt 1: Generate content
↓
Prompt 2: Evaluate quality
→ "Rate this content 1-5"
↓
Conditional: If score < 3
→ Regenerate with different approach
↓
Dataset Sink: Save only high-quality outputsSampling Strategy
For high-volume flows:
Save: Every output
Evaluate: Random 10% sample
Sufficient for:
- Monitoring trends
- Detecting quality issues
- Lower evaluation costsQuality Metrics
Track Over Time
Week 1
3.2/5
Week 2
↑ improved prompt
3.8/5
Week 3
↑ refined again
4.1/5
Week 4
✓ stable
4.2/5
Key Metrics
- Average Score: Overall quality
- Pass Rate: % meeting threshold
- Failure Patterns: Common issues
- Edge Cases: Unexpected inputs
- Cost per Quality Point: Efficiency
Dashboard View
Website Analyzer
Flow Performance Overview
Total Executions
1,234
Average Quality Score
4.2/5
Pass Rate
87%
Cost per Run
$0.42
Top Issues
Timeout on slow sites
8%
Incomplete summaries
5%
Incorrect categorization
3%
Info
Set quality thresholds for your flows. Alert when scores drop below acceptable levels.
Improvement Workflow
Identify Issues
Evaluation shows:
- 15% of outputs score < 3/5
- Common issue: Missing key details
- Pattern: Short input texts
Root cause:
Prompt doesn't handle brief inputs wellRefine in Playground
1. Export low-scoring examples
2. Load into Playground
3. Test prompt variations
4. Run experiments with evaluators
5. Find better prompt
6. Measure improvementUpdate Flow
1. Update prompt in Flow
2. Test with sample data
3. Deploy new version
4. Monitor quality metrics
5. Verify improvementA/B Testing
Run two flow versions:
Version A: Original prompt (50% traffic)
Version B: New prompt (50% traffic)
Compare results:
Version A: 3.8/5 average
Version B: 4.3/5 average ✓
Roll out: Deploy Version B to 100%Automated Quality Checks
Self-Evaluation Flow
Prompt 1: "Generate product description"
out: description
↓
Prompt 2: "Rate this description 1-5"
input: description
out: score, feedback
↓
If score >= 4:
→ Dataset Sink (approved)
If score < 4:
→ Dataset Sink (needs review)Multi-Stage Quality Gates
Stage 1: Generate content
Stage 2: Check grammar/spelling
Stage 3: Verify factual accuracy
Stage 4: Assess tone/style
Only outputs passing all gates are savedHuman-in-the-Loop
Flow outputs
→ Auto-evaluate
→ If score < threshold:
→ Flag for human review
→ Save to review queue
→ If score >= threshold:
→ Auto-approve
→ Save to production datasetWarning
Don't rely solely on automated evaluation for critical use cases. Combine AI evaluation with periodic human review.
Cost-Quality Tradeoffs
Optimization Strategies
Option 1: Highest Quality
Model: GPT-4
Temperature: 0.3
Evaluation: Comprehensive
Cost: $1.50 per run
Quality: 4.5/5
Option 2: Balanced
Model: GPT-4
Temperature: 0.7
Evaluation: Key metrics only
Cost: $0.75 per run
Quality: 4.2/5
Option 3: Cost-Optimized
Model: GPT-3.5
Temperature: 0.7
Evaluation: Sampling
Cost: $0.15 per run
Quality: 3.8/5Finding the Sweet Spot
- Measure quality vs cost
- Identify minimum acceptable quality
- Optimize prompt to reduce tokens
- Use cheaper models where possible
- Evaluate only what matters
Real-World Example
Customer Support Automation
Flow: Ticket Response Generator
Input: Customer question
Output: Suggested response
Initial Performance:
Quality: 3.5/5
Approval Rate: 60%
Issue: Too generic, misses details
Improvement Cycle:
Week 1: Add context to prompt
→ Quality: 3.9/5
Week 2: Use examples in prompt
→ Quality: 4.2/5
Week 3: Add validation step
→ Quality: 4.4/5
Week 4: Refine edge cases
→ Quality: 4.5/5
Final Result:
Approval Rate: 85%
Response Time: 2s (was 2 hours)
Cost: $0.08 per ticket
Customer Satisfaction: +15%Best Practices
1. Evaluate Early and Often
- Don't wait until production
- Test with evaluators during development
- Catch issues before deployment
- Iterate quickly
2. Use Multiple Evaluators
- Accuracy evaluator
- Completeness evaluator
- Tone/style evaluator
- Format validator
3. Monitor Production Quality
Set up alerts:
- Quality drops below 4.0
- Pass rate < 80%
- Error rate > 5%
- Cost increase > 20%4. Version Everything
- Track prompt versions
- Link evaluations to flow versions
- Enable rollback if needed
- Document changes and improvements
The best AI systems are continuously evaluated and improved. Make evaluation a core part of your workflow, not an afterthought.