Docs / Custom Evaluations

Custom Evaluations

Create evaluators tailored to your specific domain, business rules, and quality standards. While built-in evaluators handle common cases, custom evaluations let you encode the nuanced requirements that make your AI application successful.

Custom evaluations bridge the gap between generic quality metrics and real-world product requirements. They capture domain expertise, regulatory constraints, brand guidelines, and user experience principles that can't be measured with off-the-shelf evaluators.

This guide shows you how to build, test, and deploy custom evaluators that integrate seamlessly with Evaligo's evaluation framework. You'll learn to create evaluators that scale from individual experiments to enterprise-wide quality gates.

Whether you're checking for industry compliance, enforcing brand voice, validating business logic, or measuring user satisfaction, custom evaluators ensure your AI meets the standards that matter most to your organization.

When to Build Custom Evaluators

Custom evaluators are essential when your quality requirements go beyond standard metrics like accuracy, toxicity, or hallucinations. They're particularly valuable for specialized domains, regulated industries, and unique user experiences.

Build custom evaluators when you need to check for specific formatting requirements, domain terminology usage, compliance with business rules, adherence to brand guidelines, or integration with existing quality assurance processes.

Info

Begin with rule-based evaluators for clear, deterministic checks before moving to LLM-based evaluators for more complex judgment calls. This progression helps you understand the evaluation framework and build confidence in your approach.

Types of Custom Evaluators

Evaligo supports multiple approaches to custom evaluation, each suited to different types of quality requirements and technical constraints.

Rule-Based Evaluators

Use code-based logic for deterministic checks like format validation, length constraints, keyword presence, or compliance with structured requirements. These are fast, consistent, and easy to debug.

LLM-Based Evaluators

Leverage large language models to make nuanced judgment calls about quality, appropriateness, helpfulness, or adherence to complex guidelines. These handle subjective criteria that are difficult to encode as rules.

Hybrid Evaluators

Combine rule-based and LLM-based approaches for comprehensive evaluation. Use rules for clear constraints and LLMs for nuanced judgment, getting the benefits of both approaches.

Building Your First Custom Evaluator

Let's build a custom evaluator step-by-step. This example creates a customer service response evaluator that checks for empathy, solution-oriented language, and appropriate escalation handling.

1
Define Requirements Specify what good looks like for your domain with clear, measurable criteria.
2
Choose Evaluation Method Select rule-based, LLM-based, or hybrid approach based on your requirements.
3
Implement and Test Code your evaluator with comprehensive test cases covering edge cases.
4
Deploy and Monitor Integrate with Evaligo and monitor performance across experiments.

Example: Customer Service Response Evaluator

This evaluator assesses customer service responses across multiple dimensions: empathy, solution focus, professionalism, and escalation appropriateness.

from evaligo import BaseEvaluator, EvaluationResult

class CustomerServiceEvaluator(BaseEvaluator):
    def __init__(self):
        super().__init__(
            name="customer-service-quality",
            description="Evaluates customer service responses",
            version="1.0.0"
        )
        
        self.empathy_phrases = [
            "I understand", "I apologize", "I'm sorry", 
            "That must be frustrating", "Thank you for"
        ]
    
    def evaluate(self, input_text, output_text, context=None):
        scores = {}
        
        # Check for empathy
        empathy_score = self._check_empathy(output_text)
        scores['empathy'] = empathy_score
        
        # Check for solution orientation  
        solution_score = self._check_solution_focus(output_text)
        scores['solution_focus'] = solution_score
        
        overall_score = sum(scores.values()) / len(scores)
        
        return EvaluationResult(
            score=overall_score,
            passed=overall_score >= 0.7,
            details={'scores': scores}
        )

Evaluator testing and validation interface

LLM-Based Custom Evaluators

For more nuanced evaluation that requires understanding context, tone, and subjective quality, LLM-based evaluators provide sophisticated judgment capabilities that complement rule-based approaches.

LLM evaluators are particularly powerful for assessing creativity, helpfulness, coherence, brand alignment, and other qualities that require human-like judgment but need to scale beyond manual review.

class BrandVoiceEvaluator(BaseEvaluator):
    def __init__(self, brand_guidelines):
        super().__init__(
            name="brand-voice-consistency",
            description="Evaluates content for brand voice consistency",
            version="1.0.0"
        )
        self.brand_guidelines = brand_guidelines
        self.llm_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    def evaluate(self, input_text, output_text, context=None):
        evaluation_prompt = f"""
        Evaluate this response for brand voice consistency:
        
        Brand Guidelines: {self.brand_guidelines}
        Response: {output_text}
        
        Rate 1-10 for:
        - Tone consistency
        - Vocabulary alignment  
        - Style consistency
        - Brand personality
        
        Format: Score: [1-10], Feedback: [details]
        """
        
        response = self.llm_client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": evaluation_prompt}],
            temperature=0.1
        )
        
        return self._parse_llm_response(response.choices[0].message.content)

Warning

LLM-based evaluators can be slower and more expensive than rule-based ones. Consider cost, latency, and consistency requirements when choosing your approach. Hybrid evaluators often provide the best balance.

Testing and Deployment

Rigorous testing ensures your custom evaluators behave correctly across edge cases and maintain consistency over time. Build comprehensive test suites that validate both the logic and the outputs.

Create golden datasets with known good and bad examples, test edge cases, validate scoring consistency, and monitor performance in production to ensure your evaluators remain accurate and useful.

Video

Testing custom evaluators and handling edge cases

Evaluator deployment and management interface

Custom evaluator performance monitoring dashboard

Best Practices

Follow these practices to create evaluators that are reliable, maintainable, and valuable for your organization.

Design for Clarity

Make evaluation criteria explicit and measurable. Document what each score means and provide examples of different score levels to ensure consistent interpretation.

Handle Edge Cases

Test with empty inputs, very long inputs, unusual formatting, and other edge cases your AI might encounter in production.

Version Control

Treat evaluators like code with proper versioning, testing, and deployment processes. This maintains reproducibility and enables rollbacks when needed.

Tip

Involve domain experts in evaluator design and validation. Their expertise ensures your evaluators capture the nuances that matter most for quality in your specific domain.

Next Steps

With custom evaluators in your toolkit, you can build comprehensive quality assurance that reflects your unique requirements and scales across your organization.

Evaluation Templates

Pre-built evaluators for common use cases

Run Evaluations in UI

Execute evaluations through the interface

LLM as a Judge

Using LLMs for sophisticated evaluation

CI/CD Integration

Automated quality gates in your pipeline