Ratings¶

A rating is a way to evaluate the quality of a response to a prompt. Ratings are created by comparing the response to a criterion. The dashboard displays rated responses with detailed metrics and allows for interactive analysis of the evaluation results. A typical rating entry looks like this.

Here you can see:

Overall score aggregated from individual criteria ratings
Individual ratings for each evaluation criterion
LLM configuration details used to generate the response
Word count statistics for both the template and generated response

If for any reason the rating of individual criteria should not be correct, the user can make manual improvements here.

Rate with generated criteria and responses¶

In the Quick Start, we show you how you can run a simple rating of multiple criteria for one response. In this example the criteria as well as the responses are generated automatically by Elluminate.

Useful for:

Quick Evaluation: Evaluate responses without defining criteria manually
Quality Generation: Generate high quality dynamic criteria based on the prompt context
Framework Development: Explore or prototype evaluation frameworks
AI-Powered Assessment: Leverage AI-generated criteria for comprehensive assessment

from dotenv import load_dotenv
from elluminate import Client
from elluminate.schemas import RatingMode

load_dotenv(override=True)

client = Client()  # (1)!

prompt_template, _ = client.prompt_templates.get_or_create(
    "Explain how {{concept}} works in Scheme, providing a short but illustrative code example.",
    name="Scheme Concepts",
)  # (2)!

client.criteria.get_or_generate_many(prompt_template)  # (3)!


collection, _ = client.collections.get_or_create(
    name="Scheme Concepts",
)  # (4)!

template_variables = client.template_variables.add_to_collection(
    template_variables={"concept": "recursion"},
    collection=collection,
)  # (5)!

response = client.responses.generate(
    prompt_template,
    template_variables=template_variables,
)  # (6)!

# Generate ratings for a given `response`
ratings = client.ratings.rate(response, rating_mode=RatingMode.FAST)  # (7)!
for rating in ratings:
    print(f"Criteria: {rating.criterion.criterion_str}")
    print(f"Rating: {rating.rating}\n")

Initializes the Elluminate client using your configured environment variables from the setup phase.
Creates a prompt template using mustache syntax, incorporating template variables (like concept in this example). If the template already exists, it just gets returned.
Generates evaluation criteria automatically for your prompt template or gets the existing criteria.
Creates a template variables collection. This will be used to collect the template variables for a prompt template.
Adds a template variable to the collection. This will be used to fill in the template variable (replacing concept with recursion).
Creates a response by using your prompt template and filling in the template variable.
Evaluates the response against the generated criteria, returning detailed ratings for each criterion.

The generated criteria will be associated with the prompt template and can be inspected one by one in the Template Details. Here you can also change or delete entries or completly start a new generation process.

The generated response can be viewed by clicking View Details in the Dashboard View. This opens the Prompt Response Detail View which shows the corresponding response and the associated prompt.

Rate custom criteria and responses¶

For more control over the evaluation process, you can manually specify criteria and responses. This is particularly useful when you:

Custom Quality Standards: Define and enforce specific quality criteria tailored to your use case
External Response Evaluation: Assess responses from any source, not limited to those generated through Elluminate
Standardized Testing: Apply uniform evaluation criteria to maintain consistency across test suites

from dotenv import load_dotenv
from elluminate import Client
from elluminate.schemas import LLMConfig, RatingMode

load_dotenv(override=True)

client = Client()
user_prompt = "Describe the historic first flight of the {{aircraft}}, including when and where it happened."
prompt_template, _ = client.prompt_templates.get_or_create(user_prompt, name="Aircraft First Flights")

criteria = client.criteria.add_many(
    [
        "Does it mention the specific date of the first flight?",
        "Does it mention the location of the first flight?",
        "Does it mention who piloted the aircraft?",
    ],
    prompt_template,
    delete_existing=True,
)  # (1)!

template_variables = client.template_variables.add_to_collection(
    template_variables={"aircraft": "Boeing 747"},
    collection=prompt_template.default_template_variables_collection,
)  # (2)!

response = client.responses.add(
    response="The Boeing 747 made its first flight on February 9, 1969, taking off from Paine Field in Everett, Washington. The aircraft was piloted by test pilots Brien Wygle and Jess Wallick during this historic 1 hour and 15 minute maiden flight.",
    prompt_template=prompt_template,
    template_variables=template_variables,
    metadata=LLMConfig(llm_model_name="openai/gpt-4o-mini"),
)  # (3)!

ratings = client.ratings.rate(response, rating_mode=RatingMode.FAST)
for rating in ratings:
    print(f"Criteria: {rating.criterion.criterion_str}")
    print(f"Rating: {rating.rating}\n")

1. Adding custom criteria to a given prompt template. Using `delete_existing=True` ensures that any existing criteria are removed before adding the new ones. This gives you full control over what aspects of the response will be evaluated.

2. Each prompt template has a default template variables collection. This will be used here to collect the template variables.

3. Adding a custom response with metadata. The response is associated with:

    - The prompt template
    - Template variables (to know which prompt generated it)
    - LLM configuration metadata (to track which model and settings were used)

Rate a set of Prompts¶

When you have multiple variations of template variables to test, you can use a TemplateVariablesCollection (or simply collection) to organize and evaluate them together. This is essential for systematic evaluation of prompt performance across different inputs.

This approach is particularly powerful when you need to:

Input Testing: Evaluate how well your prompt performs with diverse test cases and edge cases
Performance Analysis: Uncover trends and patterns in how your prompt responds to different inputs
Iterative Optimization: Use insights from testing to refine and enhance your prompt template

from dotenv import load_dotenv
from elluminate import Client
from elluminate.schemas import RatingMode

load_dotenv(override=True)

client = Client()

values = [
    {"university": "MIT", "state": "Massachusetts"},
    {"university": "Stanford", "state": "California"},
]  # (1)!

collection, _ = client.collections.get_or_create(
    name="Top Universities", description="A collection of prestigious US universities"
)  # (2)!

for template_variables in values:  # (3)!
    client.template_variables.add_to_collection(
        template_variables=template_variables,
        collection=collection,
    )

prompt_template, _ = client.prompt_templates.get_or_create(
    "List the most impactful Nobel Prize winners from {{university}} in {{state}} "
    "and their breakthrough discoveries.",
    name="University Nobel Laureates",
)  # (4)!

client.criteria.generate_many(prompt_template)

responses = client.responses.generate_many(prompt_template, collection=collection)  # (5)!

for response in responses:
    ratings = client.ratings.rate(response, rating_mode=RatingMode.FAST)
    for rating in ratings:
        print(f"Criteria: {rating.criterion.criterion_str}")
        print(f"Rating: {rating.rating}\n")

Define your test cases as template variables
1. Create a collection to store your test cases
2. Add each set of variables to the collection
3. The prompt template has to be compatible with your variables

The Collections view provides an organized overview of your template variable sets. Each collection is identified by its name and contains multiple entries, with timestamps showing when they were added.

When you go to the Dashboard View you can chose the prompt template and can see the ratings for each response to all prompts generated from the collection. Here you can easily inspect how the responses perform according to a entry in the collection.