Docs / Custom Evaluations
Custom Evaluations
Create evaluators tailored to your specific domain, business rules, and quality standards. While built-in evaluators handle common cases, custom evaluations let you encode the nuanced requirements that make your AI application successful.
Custom evaluations bridge the gap between generic quality metrics and real-world product requirements. They capture domain expertise, regulatory constraints, brand guidelines, and user experience principles that can't be measured with off-the-shelf evaluators.
This guide shows you how to build, test, and deploy custom evaluators that integrate seamlessly with Evaligo's evaluation framework. You'll learn to create evaluators that scale from individual experiments to enterprise-wide quality gates.
Whether you're checking for industry compliance, enforcing brand voice, validating business logic, or measuring user satisfaction, custom evaluators ensure your AI meets the standards that matter most to your organization.
When to Build Custom Evaluators
Custom evaluators are essential when your quality requirements go beyond standard metrics like accuracy, toxicity, or hallucinations. They're particularly valuable for specialized domains, regulated industries, and unique user experiences.
Build custom evaluators when you need to check for specific formatting requirements, domain terminology usage, compliance with business rules, adherence to brand guidelines, or integration with existing quality assurance processes.


Types of Custom Evaluators
Evaligo supports multiple approaches to custom evaluation, each suited to different types of quality requirements and technical constraints.
Rule-Based Evaluators
Use code-based logic for deterministic checks like format validation, length constraints, keyword presence, or compliance with structured requirements. These are fast, consistent, and easy to debug.
LLM-Based Evaluators
Leverage large language models to make nuanced judgment calls about quality, appropriateness, helpfulness, or adherence to complex guidelines. These handle subjective criteria that are difficult to encode as rules.
Hybrid Evaluators
Combine rule-based and LLM-based approaches for comprehensive evaluation. Use rules for clear constraints and LLMs for nuanced judgment, getting the benefits of both approaches.
Building Your First Custom Evaluator
Let's build a custom evaluator step-by-step. This example creates a customer service response evaluator that checks for empathy, solution-oriented language, and appropriate escalation handling.
- 1
Define Requirements Specify what good looks like for your domain with clear, measurable criteria.
- 2
Choose Evaluation Method Select rule-based, LLM-based, or hybrid approach based on your requirements.
- 3
Implement and Test Code your evaluator with comprehensive test cases covering edge cases.
- 4
Deploy and Monitor Integrate with Evaligo and monitor performance across experiments.
Example: Customer Service Response Evaluator
This evaluator assesses customer service responses across multiple dimensions: empathy, solution focus, professionalism, and escalation appropriateness.
from evaligo import BaseEvaluator, EvaluationResult
class CustomerServiceEvaluator(BaseEvaluator):
def __init__(self):
super().__init__(
name="customer-service-quality",
description="Evaluates customer service responses",
version="1.0.0"
)
self.empathy_phrases = [
"I understand", "I apologize", "I'm sorry",
"That must be frustrating", "Thank you for"
]
def evaluate(self, input_text, output_text, context=None):
scores = {}
# Check for empathy
empathy_score = self._check_empathy(output_text)
scores['empathy'] = empathy_score
# Check for solution orientation
solution_score = self._check_solution_focus(output_text)
scores['solution_focus'] = solution_score
overall_score = sum(scores.values()) / len(scores)
return EvaluationResult(
score=overall_score,
passed=overall_score >= 0.7,
details={'scores': scores}
)


LLM-Based Custom Evaluators
For more nuanced evaluation that requires understanding context, tone, and subjective quality, LLM-based evaluators provide sophisticated judgment capabilities that complement rule-based approaches.
LLM evaluators are particularly powerful for assessing creativity, helpfulness, coherence, brand alignment, and other qualities that require human-like judgment but need to scale beyond manual review.
class BrandVoiceEvaluator(BaseEvaluator):
def __init__(self, brand_guidelines):
super().__init__(
name="brand-voice-consistency",
description="Evaluates content for brand voice consistency",
version="1.0.0"
)
self.brand_guidelines = brand_guidelines
self.llm_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def evaluate(self, input_text, output_text, context=None):
evaluation_prompt = f"""
Evaluate this response for brand voice consistency:
Brand Guidelines: {self.brand_guidelines}
Response: {output_text}
Rate 1-10 for:
- Tone consistency
- Vocabulary alignment
- Style consistency
- Brand personality
Format: Score: [1-10], Feedback: [details]
"""
response = self.llm_client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": evaluation_prompt}],
temperature=0.1
)
return self._parse_llm_response(response.choices[0].message.content)
Testing and Deployment
Rigorous testing ensures your custom evaluators behave correctly across edge cases and maintain consistency over time. Build comprehensive test suites that validate both the logic and the outputs.
Create golden datasets with known good and bad examples, test edge cases, validate scoring consistency, and monitor performance in production to ensure your evaluators remain accurate and useful.
Video



Best Practices
Follow these practices to create evaluators that are reliable, maintainable, and valuable for your organization.
Design for Clarity
Make evaluation criteria explicit and measurable. Document what each score means and provide examples of different score levels to ensure consistent interpretation.
Handle Edge Cases
Test with empty inputs, very long inputs, unusual formatting, and other edge cases your AI might encounter in production.
Version Control
Treat evaluators like code with proper versioning, testing, and deployment processes. This maintains reproducibility and enables rollbacks when needed.
Next Steps
With custom evaluators in your toolkit, you can build comprehensive quality assurance that reflects your unique requirements and scales across your organization.