Skip to content

Criterion Sets

Organize evaluation criteria into reusable sets that enable consistent assessment

Criterion sets are collections of evaluation criteria that define how AI responses should be assessed. They enable systematic evaluation by grouping related criteria together, so they can be used easily in your experiments.

What are Criterion Sets?

A criterion set contains one or more criteria - binary evaluation questions that rate AI responses as "pass" or "fail." Each criterion asks a specific yes/no question about response quality, such as "Does the response answer the question accurately?" or "Is the response free from harmful content?".

Criterion sets provide consistency by applying the same evaluation standards across your experiments, enable reusability through shared criteria, and improve efficiency by reducing duplicate work when evaluating similar prompt variations.

Key Concepts

Criteria: Individual evaluation questions that assess specific aspects of response quality. Each criterion must be answerable with yes/no and should target one particular aspect of response quality.

Template Linking: Criterion sets can be linked to prompt templates. This association determines which criterion set will be selected by defaul for experiments with this prompt template.

Collection Compatibility: Criterion sets must be compatible with collections to be used in experiments. Compatibility means that criteria placeholders (e.g., {{user_question}}) match the collection's column names exactly. When creating experiments, only compatible criterion sets are available for selection.

Version Control: Automatic versioning tracks changes to criteria for reproducibility, ensuring experiments remain consistent even when criteria are updated.

Getting Started

Creating Your First Criterion Set

Navigate to Criteria Library in your project sidebar to manage all criterion sets. Click "New Criterion Set" to create your first set.

Criteria Library Overview

Required Information:

  • Name: Descriptive identifier for the criterion set (e.g., "Content Safety", "Technical Accuracy")
  • Description: Brief explanation of the set's purpose and scope

Create Criterion Set Dialog

Adding Criteria

Once you've created a criterion set, click on it to add individual criteria. Each criterion should:

  • Start with "Does" or "Is" for binary evaluation
  • Focus on observable response characteristics
  • Use clear, unambiguous language
  • Target one specific aspect of response quality

Example Criteria:

  • "Does the response provide accurate information?"
  • "Is the response free from harmful content?"
  • "Does the response follow the requested format?"

Criterion Set Detail View

Linking to Templates

Criterion sets can be linked to prompt templates. You can link sets when creating templates or use the "Link a Prompt Template" feature to connect existing sets.

When you create experiments, they automatically use the criterion sets linked to your selected prompt template. The evaluation system generates responses, applies linked criteria to evaluate each response, and produces pass/fail ratings for each criterion.

Advanced Features

Template-Set Associations

Criterion sets can be linked to multiple templates, but this association is not fixed, and can be adjusted to your needs in each experiment. This flexibility allows you to:

  • Apply universal criteria (safety, basic quality) across all templates
  • Use specific criteria for particular use cases (customer service, technical documentation)
  • Combine different evaluation focuses for comprehensive assessment

Managing Associations:

  • View all templates linked to a specific criterion set
  • Link a different criterion set to existing templates
  • Unlink sets while preserving historical experiment data

Results and Analysis

Experiment results show performance by individual criterion, including pass rates for each criterion, response-level ratings showing which criteria passed/failed, and aggregate scores combining all criteria into overall template performance.

When comparing experiments over time, you can analyze trends in criterion performance to understand which aspects of your prompts are improving or need attention.

Best Practices

Set Organization Strategy

Organize criterion sets by purpose to maintain clarity and reusability. Common groupings include:

Accuracy Sets: Content correctness, factual accuracy, completeness

Safety Sets: Harmful content, bias detection, compliance requirements

Quality Sets: Clarity, coherence, professional tone

Functional Sets: Task completion, format adherence, instruction following

Criteria Design Guidelines

Design criteria to be comprehensive yet focused. Ensure your criteria cover all important evaluation dimensions while maintaining balanced expectations that are neither too lenient nor impossibly strict.

Focus on actionable feedback - results should indicate specific improvement areas rather than just pass/fail status. Use clear, unambiguous language that minimizes subjective interpretation.

Workflow Integration

Define criteria before creating prompt templates to ensure evaluation standards are established early. Create criterion sets for each major evaluation category and link the most commonly used to your prompt template.

Validate criteria through initial experiment runs and iterate based on evaluation results. For team collaboration, use consistent criterion sets across team members and maintain clear documentation of what each criterion tests.

SDK Integration

For programmatic criterion set management, use the elluminate SDK:

from elluminate import Client

client = Client()  # Uses ELLUMINATE_API_KEY env var

# Create criterion set
criterion_set = client.criterion_sets.create(
    name="Content Quality",
    description="Evaluate response accuracy and completeness"
)

# Add criteria to the set
criterion = client.criteria.create(
    criterion_set=criterion_set,
    label="Accuracy",
    criterion_str="Does the response provide accurate information?"
)

# Link to prompt template
template = client.prompt_templates.get("Customer Support Bot")
client.criterion_sets.link_template(criterion_set, template)

# Use in experiment
experiment = client.experiments.create(
    name="Quality Evaluation",
    prompt_template=template,  # Uses linked criterion sets automatically
    collection=test_collection,
    llm_config=model_config
)

For complete SDK documentation, see the API Reference.

Troubleshooting

Common Issues

Criteria Not Applied: Ensure criterion sets are linked to your prompt template before running experiments. Check the template's linked criterion sets in the template detail view.

Inconsistent Results: Verify that criteria are written as binary yes/no questions. Compound questions that test multiple aspects can lead to inconsistent evaluations.

Missing Evaluations: Confirm that your LLM responses are in a format compatible with your criteria. Some criteria may require specific response structures or content types.

Criterion Set Not Available for Experiment: If your criterion set doesn't appear when creating an experiment, check that the criteria placeholders match your collection's column names exactly. Only compatible criterion sets are shown in the experiment creation form.

Getting Help

When criterion sets don't behave as expected, check the experiment logs for specific error messages, validate that criteria are properly linked to templates, and review criterion wording for clarity and objectivity.

Understanding criterion sets enables systematic, reproducible evaluation of AI responses while maintaining consistency across your evaluation workflows.