Docs / Datasets & Experiments

Datasets & Experiments

Get hands-on with Evaligo in under 10 minutes. You'll create your first project, import a dataset, configure prompt variants, and run an experiment to see immediate, actionable results.

This guide walks you through the complete workflow from setup to results analysis. By the end, you'll understand how Evaligo helps you systematically improve AI quality and catch regressions before they reach production.

Whether you're evaluating LLM responses, comparing prompt variants, or testing model performance across different use cases, this quickstart establishes the foundation for all your future AI evaluation workflows.

Prerequisites

Before starting, ensure you have the following ready to maximize your success with this tutorial.

  1. 1

    Evaligo Account Sign up for a free account at evaligo.com if you haven't already.

  2. 2

    API Credentials Have your OpenAI, Azure, or AWS Bedrock API keys ready (optional for this tutorial).

  3. 3

    Sample Data Prepare a small CSV or JSONL file with 10-20 test cases, or use our provided sample dataset.

Info
Start with 10-20 test cases for faster iteration. You can always expand your dataset later as you refine your evaluation criteria.

Generate Test Data with Evaly Co-Pilot

Automatically generate diverse test data using Evaly, your AI co-pilot, to accelerate dataset creation and cover more edge cases.

AI-Guided Prompt Generation

Leverage Evaly to generate optimized prompts tailored to your task, boosting evaluation scores and model performance with AI-guided suggestions.

Step 1: Create Your First Project

Projects in Evaligo organize all your experiments, datasets, and evaluations around a specific use case or application. Think of it as a workspace for a particular AI feature you're building.

Navigate to the dashboard and click "New Project". Name it something descriptive like "Customer Support Bot Evaluation" or "Content Generation Testing". Add a brief description of what you're planning to evaluate.

The project serves as the central hub where you'll track prompt performance, dataset quality, and evaluation results over time. This organization becomes invaluable as your evaluation workflows grow in complexity.

Tip
Use descriptive project names that include the application type and team. Examples: "Support-Bot-Q4", "Content-Gen-Marketing", "SQL-Assistant-Data-Team".

Step 2: Import and Configure Your Dataset

Quality datasets are the foundation of reliable AI evaluation. Your dataset should represent real-world inputs your AI system will encounter, with expected outputs or reference answers when possible.

Upload your CSV or JSONL file using the dataset import wizard. The system will automatically detect columns and suggest mappings for input fields, expected outputs, and any additional metadata you want to track.

During import, you'll map your data columns to Evaligo's standard fields: inputs (what gets sent to your AI), references (expected outputs), and any custom metadata. This mapping ensures consistent evaluation across all your experiments.

If you don't have a dataset ready, you can start with our sample "Customer Inquiry" dataset that includes varied support questions and reference responses across different complexity levels.

Automatically generate diverse test data using Evaly, your AI co-pilot, to accelerate dataset creation and cover more edge cases.

Evaluator Scores Breakdown

Visualize how each evaluator scores your AI outputs across multiple criteria, enabling granular quality analysis and targeted improvements.

Step 3: Configure Prompt Variants

Now you'll create different versions of your prompt to test which approach works best for your use case. Even small changes in wording, examples, or structure can significantly impact AI performance.

In the Experiments section, create at least two prompt variants. For example, a "Direct" version that asks questions straightforwardly, and a "Chain-of-Thought" version that encourages step-by-step reasoning.

Select your preferred model provider (OpenAI GPT-4, Claude, or any supported LLM) and configure parameters like temperature, max tokens, and any system messages. Keep settings consistent across variants to ensure fair comparison.

Warning
Test with small batches first. Large experiments can consume significant API credits, especially when testing multiple variants across large datasets.

Step 4: Run Your First Experiment

With your project, dataset, and prompt variants configured, you're ready to run your first experiment. This is where Evaligo's power becomes apparent as it systematically tests each variant against your entire dataset.

Click "Run Experiment" and watch as Evaligo processes each test case through your prompt variants. The system tracks response quality, latency, token usage, and any errors that occur during processing.

Depending on your dataset size and model speed, experiments typically complete within 2-5 minutes. You'll see real-time progress updates and can monitor individual responses as they're generated.

Video

Complete experiment walkthrough - from setup to results analysis
Complete experiment walkthrough - from setup to results analysis

Step 5: Analyze Results and Next Steps

Once your experiment completes, you'll land on the results dashboard where you can compare prompt variants side-by-side. Look for patterns in response quality, consistency, and any edge cases where certain variants excel.

The summary view shows aggregate metrics like average response length, token consumption, and success rates. Drill down into individual responses to understand why certain variants perform better for specific types of inputs.

Use the insights from this first experiment to refine your prompts, adjust model parameters, or expand your dataset with additional edge cases you discovered during testing.

Visualize how each evaluator scores your AI outputs across multiple criteria, enabling granular quality analysis and targeted improvements.

Analyze Individual Responses

Individual response analysis
Tip
Your first experiment is just the beginning. Most teams iterate through 5-10 experiment cycles before deploying to production, refining prompts and datasets based on each round of results.

Sample Code Integration

Once you're satisfied with your prompt performance, integrate Evaligo into your development workflow using our SDK. This enables continuous evaluation as your application evolves.

// Install the Evaligo SDK
npm install @evaligo/sdk

// Initialize with your API key
import { Evaligo } from '@evaligo/sdk'
const client = new Evaligo({ 
  apiKey: process.env.EVALIGO_API_KEY 
})

// Run evaluations programmatically
const experiment = await client.experiments.create({
  projectId: 'your-project-id',
  datasetId: 'your-dataset-id',
  variants: [
    { name: 'baseline', prompt: 'Answer this question: {{input}}' },
    { name: 'enhanced', prompt: 'Analyze and answer: {{input}}' }
  ],
  model: { provider: 'openai', name: 'gpt-4' }
})

// Monitor results
const results = await client.experiments.getResults(experiment.id)
console.log(`Baseline accuracy: ${results.variants.baseline.accuracy}`)

Next Steps

Congratulations! You've completed your first end-to-end evaluation workflow. You're now ready to explore Evaligo's advanced features and integrate evaluation into your AI development process.

Advanced Evaluations
Custom metrics and LLM judges
Production Tracing
Monitor live AI applications
CI/CD Integration
Automated testing pipelines
Team Collaboration
Share datasets and results