Experiments¶

Experiments help you systematically compare different prompts and evaluate their effectiveness. They are especially useful when you want to test multiple variations of prompts and analyze which performs best for your use case. Additionally, experiments allow you to analyze which evaluation criteria generally perform well across prompts and which ones consistently fail, helping you identify systematic strengths and weaknesses in your prompt designs.

An experiment combines several components:

A prompt template as the foundation
A set of inputs (template variables or collection)
Evaluation criteria and ratings for the responses

Feel free to check out the Key Concepts section to get a detailed overview of the concepts. But before we dive to deep into the details, let us look at an example.

from dotenv import load_dotenv
from elluminate import Client
from elluminate.schemas import RatingMode

load_dotenv(override=True)

client = Client()

values = [
    {"person": "Richard Feynman", "field": "Quantum Electrodynamics"},
    {"person": "Albert Einstein", "field": "Relativity"},
]

collection, _ = client.collections.get_or_create(
    name="Famous Physicists", description="A collection of renowned theoretical physicists"
)

for template_variables in values:
    client.template_variables.add_to_collection(template_variables, collection=collection)

prompt_template, _ = client.prompt_templates.get_or_create(
    "Please write a brief biography highlighting {{person}}'s contributions to {{field}} "
    "and their impact on modern physics. No more than 3 paragraphs.",
    name="Physicist Biography",
)

client.criteria.get_or_generate_many(prompt_template)

responses = client.responses.generate_many(prompt_template, collection=collection)

experiment, _ = client.experiments.get_or_create(
    "Physics Pioneers Analysis",
    prompt_template=prompt_template,
    collection=collection,
    description="Evaluating biographies of influential physicists and their contributions",
)  # (1)!

for response in responses:
    client.ratings.rate(response, experiment=experiment, rating_mode=RatingMode.FAST)  # (2)!

# Update the experiment to include the responses and results
experiment = client.experiments.get(experiment.name)

# Display individual responses and ratings
print("\n===== Individual Responses =====")
for response in experiment.rated_responses:  # (3)!
    print(f"\nResponse for {response.prompt.template_variables.input_values.get('person')}:")
    print(f"{response.messages[-1].content}\n")
    print("Ratings:")
    for rating in response.ratings:
        print(f"Criteria: {rating.criterion.criterion_str}")
        print(f"Rating: {rating.rating}\n")

# Display aggregated results using the new summary method
print("\n===== Aggregated Results =====")
experiment.print_results_summary()

1. Create an `Experiment`. If the `Experiment` already exists, it will be returned. Note that the same `PromptTemplate` and `Collection` that were used to generate the `Response`s are being used here.

2. Here the `Response`s are being rated. By passing the previously created `Experiment`, the rated `Response`s are linked to the `Experiment`.

3. The rated `Response`s are accessible via the `rated_responses` property of the `Experiment`.

View Experiments¶

You can view the results of an experiment in the UI. The experiment view displays helpful statistics that help you recognize which criteria are generally met and which are not. The Experiment View looks like this. Here some statistics are displayed:

Overall Score: The mean and standard deviation of all responses
Average Tokens: The mean and standard deviation of the tokens used for all responses
Weakest Criteria: The least fulfilled criteria and the percentage of responses that fulfilled it

In the Criteria Performance graph which criteria are generally met and which are particularly challenging.

There is also a detailed analysis tab that looks like this. Here you can see the Score Distribution and the Criteria Performance Analysis of the responses to understand the overall performance and identify outliers. Furthermore, you can examine response examples that are categorized and sorted by performance.

In the individual response tab you can inspect each response in detail.

Compare Experiments¶

If you have multiple experiments you can start a comparison in the UI's dashboard view. This is especially useful for comparing responses from different prompts based on different prompt templates.

Then you get a detailed statistics of the comparison of the experiments. Here you can see how the criteria were met in comparison and how the prompts performed in comparison.

You also can view a detailed comparison report which shows:

How many responses improved, regressed or have the same score but different ratings
A detailed listing of the ratings for that responses
A summary of which criteria failed the most

Experiments are great for:

Comparative Analysis: Compare different prompt templates or approaches side by side. For example, testing different versions of the same prompt to find the most effective formulation.
Quality Evaluation: Track and analyze response quality through systematic rating collection. This helps identify patterns in response quality and areas for improvement.
Performance Tracking: Calculate and monitor key metrics across responses within an experiment. This is especially useful when testing different configurations or approaches.

Monitoring Experiments¶

To monitor the performance of experiments over time, click the View Timeline button in the Experiment Dashboard.

This will take you to the Experiment Timeline, where you can see the scores of all experiments over time. You can filter the experiments by date, prompt template, collection, and model configuration. Hover over any data point to view corresponding experiment information in a tooltip. Clicking on a data point will link you to the detailed experiment view described in the View Experiments section.

Schedule Experiments¶

You can also schedule experiments to run on a regular basis. This is particularly useful when you want to run evaluations on a schedule without manual intervention. For scheduling an experiment click on the Schedule Run button in the Experiment Timeline View. Here you can configure the schedule.

You can configure:

Frequency:

Daily: Run once per day at a specific time
Weekly: Run on specific days of the week at a specific time
Monthly: Run on specific days of the month at a specific time

Notification Settings:

Email addresses to notifications
Custom notification thresholds. If the average score of an experiment falls below this threshold, notifications will be sent to the configured email addresses

Experiment Parameters:

Prompt Template
Template Variables Collection
Model for generating responses
Rating mode (fast vs. detailed)

Your scheduled runs will be displayed in the Scheduled Runs section. Here, you have a clear overview of your scheduled runs and the most important settings, such as the schedule time.

By selecting the filter icon on a schedule, only the runs from that schedule will be displayed. Additionally, the configured threshold for the average score of the experiment runs will be displayed. This allows you to focus on specific experiment runs, making it easier to track and analyze the performance.

You can also modify an existing scheduled run if you click on the edit icon. You can adjust:

Schedule frequency
Notification threshold
Notification emails
Disable the schedule

Structured Outputs and Tool Calling¶

Experiments work seamlessly with structured outputs and tool calling. They can be rated just like ordinary responses which enables Elluminate to be particularly powerful to evaluate agentic applications.