Reference Answers¶

Reference answers provide a way to automatically evaluate generated responses against known correct answers. This is particularly useful for:

Accuracy Testing: Verify if LLM responses match expected answers
Quality Assurance: Ensure consistent performance across different prompts

Key Steps¶

Create a prompt template with template variables. Please make sure that {{reference_answer}} isn't used as a template variable in your prompt template, otherwise you'll be giving away the answer.
Add criteria that use the template variable {{reference_answer}}
Create a Collection with all template variables that are used in the prompt template and the {{reference_answer}} template variable.
Thats it! You can now rate your responses against the reference answers.

The following example demonstrates how to use reference answers with elluminate:

from dotenv import load_dotenv
from elluminate import Client

load_dotenv(override=True)

client = Client()

# Create a prompt template for geography questions about capitals
prompt_template, _ = client.prompt_templates.get_or_create(
    "What is the capital city of the following country: {{country}}. Give only the name of the capital city in your response, nothing else.",
    name="Geography Expert - Capitals",
)  # (1)!

# Add criterion to check if the provided answer matches the reference answer
client.criteria.add_many(
    ["The correct capital city is: {{reference_answer}}. Is the given answer correct?"],
    prompt_template,
)  # (2)!

collection, _ = client.collections.get_or_create(
    name="Country Capitals",
    description="A collection of countries and their capital cities",
)

# Create template variables with countries and their correct capital cities
template_variables = [
    {
        "country": "France",
        "reference_answer": "Paris",
    },
    {
        "country": "Japan",
        "reference_answer": "Tokyo",
    },
    {
        "country": "Argentina",
        "reference_answer": "Buenos Aires",
    },
]  # (3)!

for tmp_vars in template_variables:
    client.template_variables.add_to_collection(tmp_vars, collection)


experiment = client.experiments.create_and_run(
    name="Capital Cities Experiment",
    prompt_template=prompt_template,
    collection=collection,
)

experiment = client.experiments.get(name=experiment.name)

for response in experiment.rated_responses:
    print(f"Prompt: {response.prompt.prompt_str}")
    print(f"Response: {response.response}")
    print(f"Ratings: {response.ratings[0].rating}")
    print("-" * 100)

1. First, we create the prompt template. Note that the we only use the template variables `{{country}}` here.

2. Here we add a criterion to the prompt template. Note that the criterion uses the template variable `{{reference_answer}}`, which will be filled with the reference answer from the template variables.

3. We define the template variables. Each template variable contains a value for `country` and the `reference_answer` which is the correct answer for the given country.

Evaluation Criteria¶

Define criteria that use the reference answers to evaluate responses. Common patterns include:

Exact match comparison
Semantic similarity checking