Reference Answers
Reference answers provide a way to automatically evaluate generated responses against known correct answers. This is particularly useful for:
- Accuracy Testing: Verify if LLM responses match expected answers
- Quality Assurance: Ensure consistent performance across different prompts
Key Steps
- Create a prompt template with template variables. Please make sure that
{{reference_answer}}
isn't used as a template variable in your prompt template, otherwise you'll be giving away the answer.
- Add criteria that use the template variable
{{reference_answer}}
- Create a Collection with all template variables that are used in the prompt template and the
{{reference_answer}}
template variable.
- Thats it! You can now rate your responses against the reference answers.
The following example demonstrates how to use reference answers with elluminate:
| from dotenv import load_dotenv
from elluminate import Client
load_dotenv(override=True)
client = Client()
# Create a prompt template for geography questions about capitals
prompt_template, _ = client.prompt_templates.get_or_create(
"What is the capital city of the following country: {{country}}. Give only the name of the capital city in your response, nothing else.",
name="Geography Expert - Capitals",
) # (1)!
# Add criterion to check if the provided answer matches the reference answer
client.criteria.add_many(
["The correct capital city is: {{reference_answer}}. Is the given answer correct?"],
prompt_template,
) # (2)!
collection, _ = client.collections.get_or_create(
name="Country Capitals",
description="A collection of countries and their capital cities",
)
# Create template variables with countries and their correct capital cities
template_variables = [
{
"country": "France",
"reference_answer": "Paris",
},
{
"country": "Japan",
"reference_answer": "Tokyo",
},
{
"country": "Argentina",
"reference_answer": "Buenos Aires",
},
] # (3)!
for tmp_vars in template_variables:
client.template_variables.add_to_collection(tmp_vars, collection)
experiment = client.experiments.create_and_run(
name="Capital Cities Experiment",
prompt_template=prompt_template,
collection=collection,
)
experiment = client.experiments.get(name=experiment.name)
for response in experiment.rated_responses:
print(f"Prompt: {response.prompt.prompt_str}")
print(f"Response: {response.response}")
print(f"Ratings: {response.ratings[0].rating}")
print("-" * 100)
|
1. First, we create the prompt template. Note that the we only use the template variables `{{country}}` here.
2. Here we add a criterion to the prompt template. Note that the criterion uses the template variable `{{reference_answer}}`, which will be filled with the reference answer from the template variables.
3. We define the template variables. Each template variable contains a value for `country` and the `reference_answer` which is the correct answer for the given country.
Evaluation Criteria
Define criteria that use the reference answers to evaluate responses. Common patterns include:
- Exact match comparison
- Semantic similarity checking