Skip to content

Reference Answers

Learn to evaluate LLM responses against known correct answers for accuracy testing and quality assurance

Reference answers provide a way to automatically evaluate generated responses against known correct answers. This is particularly useful for:

  • Accuracy Testing - Verify if LLM responses match expected answers
  • Quality Assurance - Ensure consistent performance across different prompts

Key Steps

  • Create a prompt template with template variables. Please make sure that {{reference_answer}} isn't used as a template variable in your prompt template, otherwise you'll be giving away the answer.
  • Add criteria that use the template variable {{reference_answer}}
  • Create a Collection with all template variables that are used in the prompt template and the {{reference_answer}} template variable.
  • Thats it! You can now rate your responses against the reference answers.

The following example demonstrates how to use reference answers with elluminate:

"""v1.0 API version of example_sdk_usage_reference_answer.py

Demonstrates using reference answers in evaluation criteria.
The criterion can include template variables like {{reference_answer}} to
compare the LLM's response against a known correct answer.
"""

from dotenv import load_dotenv
from elluminate import Client
from elluminate.schemas import RatingMode

load_dotenv(override=True)

client = Client()

# v1.0: get_or_create_prompt_template - template is part of lookup
template, _ = client.get_or_create_prompt_template(
    name="Geography Expert - Capitals",
    messages="What is the capital city of the following country: {{country}}. Give only the name of the capital city in your response, nothing else.",
)

# v1.0: Create criterion set with reference answer criterion
criterion_set, _ = client.get_or_create_criterion_set(
    name="Capital City Verification",
)

# The criterion uses {{reference_answer}} from the template variables
# This allows comparing LLM output against known correct answers
criterion_set.add_criteria(
    [
        "The correct capital city is: {{reference_answer}}. Is the given answer correct?",
    ]
)

# v1.0: Link criterion set to template
criterion_set.link_template(template)

# v1.0: get_or_create_collection
collection, _ = client.get_or_create_collection(
    name="Country Capitals",
    defaults={"description": "A collection of countries and their capital cities"},
)

# Template variables include both prompt inputs AND reference answers
# The reference_answer is used by the criterion, not the prompt
values = [
    {"country": "France", "reference_answer": "Paris"},
    {"country": "Japan", "reference_answer": "Tokyo"},
    {"country": "Argentina", "reference_answer": "Buenos Aires"},
]

# v1.0: collection.add_many() - single call for all variables
collection.add_many(variables=values)

# v1.0: run_experiment() handles everything
experiment = client.run_experiment(
    name="Capital Cities Experiment",
    prompt_template=template,
    collection=collection,
    criterion_set=criterion_set,
    rating_mode=RatingMode.FAST,
)

# Display results
for response in experiment.responses():
    print(f"Prompt: {response.prompt.messages[-1]['content']}")
    print(f"Response: {response.response_str}")
    print(f"Rating: {response.ratings[0].rating}")
    print("-" * 80)
  1. First, we create the prompt template. Note that the we only use the template variables {{country}} here.

  2. Here we add a criterion to the prompt template. Note that the criterion uses the template variable {{reference_answer}}, which will be filled with the reference answer from the template variables.

  3. We define the template variables. Each template variable contains a value for country and the reference_answer which is the correct answer for the given country.

Evaluation Criteria

Define criteria that use the reference answers to evaluate responses. Common patterns include:

  • Exact match comparison
  • Semantic similarity checking