Reference Answers
Learn to evaluate LLM responses against known correct answers for accuracy testing and quality assurance
Reference answers provide a way to automatically evaluate generated responses against known correct answers. This is particularly useful for:
- Accuracy Testing - Verify if LLM responses match expected answers
- Quality Assurance - Ensure consistent performance across different prompts
Key Steps
- Create a prompt template with template variables. Please make sure that
{{reference_answer}} isn't used as a template variable in your prompt template, otherwise you'll be giving away the answer.
- Add criteria that use the template variable
{{reference_answer}}
- Create a Collection with all template variables that are used in the prompt template and the
{{reference_answer}} template variable.
- Thats it! You can now rate your responses against the reference answers.
The following example demonstrates how to use reference answers with elluminate:
| """v1.0 API version of example_sdk_usage_reference_answer.py
Demonstrates using reference answers in evaluation criteria.
The criterion can include template variables like {{reference_answer}} to
compare the LLM's response against a known correct answer.
"""
from dotenv import load_dotenv
from elluminate import Client
from elluminate.schemas import RatingMode
load_dotenv(override=True)
client = Client()
# v1.0: get_or_create_prompt_template - template is part of lookup
template, _ = client.get_or_create_prompt_template(
name="Geography Expert - Capitals",
messages="What is the capital city of the following country: {{country}}. Give only the name of the capital city in your response, nothing else.",
)
# v1.0: Create criterion set with reference answer criterion
criterion_set, _ = client.get_or_create_criterion_set(
name="Capital City Verification",
)
# The criterion uses {{reference_answer}} from the template variables
# This allows comparing LLM output against known correct answers
criterion_set.add_criteria(
[
"The correct capital city is: {{reference_answer}}. Is the given answer correct?",
]
)
# v1.0: Link criterion set to template
criterion_set.link_template(template)
# v1.0: get_or_create_collection
collection, _ = client.get_or_create_collection(
name="Country Capitals",
defaults={"description": "A collection of countries and their capital cities"},
)
# Template variables include both prompt inputs AND reference answers
# The reference_answer is used by the criterion, not the prompt
values = [
{"country": "France", "reference_answer": "Paris"},
{"country": "Japan", "reference_answer": "Tokyo"},
{"country": "Argentina", "reference_answer": "Buenos Aires"},
]
# v1.0: collection.add_many() - single call for all variables
collection.add_many(variables=values)
# v1.0: run_experiment() handles everything
experiment = client.run_experiment(
name="Capital Cities Experiment",
prompt_template=template,
collection=collection,
criterion_set=criterion_set,
rating_mode=RatingMode.FAST,
)
# Display results
for response in experiment.responses():
print(f"Prompt: {response.prompt.messages[-1]['content']}")
print(f"Response: {response.response_str}")
print(f"Rating: {response.ratings[0].rating}")
print("-" * 80)
|
-
First, we create the prompt template. Note that the we only use the template variables {{country}} here.
-
Here we add a criterion to the prompt template. Note that the criterion uses the template variable {{reference_answer}}, which will be filled with the reference answer from the template variables.
-
We define the template variables. Each template variable contains a value for country and the reference_answer which is the correct answer for the given country.
Evaluation Criteria
Define criteria that use the reference answers to evaluate responses. Common patterns include:
- Exact match comparison
- Semantic similarity checking