Quick Start (GUI)¶

Start evaluating prompts with Elluminate using the web interface in just a few minutes. No coding required! For a code-based approach, see our SDK Quick Start.

Prerequisites¶

You'll need:

An Elluminate user account

Step 1: Log In and Access Your Project¶

Navigate to elluminate.de or your custom deployment and log in with your credentials.

Project Home

Step 2: Create a New Test Project¶

For this tutorial, we'll create a fresh project to experiment with. This keeps your test data separate from any existing work.

1. Click on the project name in the upper left corner (e.g., Demo Project)

2. In the dropdown, click Add project

Add Project

3. Fill in the Project Name: "Test Project" (or any name you prefer)

4. Optionally, fill in the Description: "A test project"

5. Click Create Project

Create New Project

6. You'll now see your new project with the empty Prompt Templates page displayed

Step 3: Create a Prompt Template¶

Create your first prompt template. You should already be on the Prompt Templates page from the previous step, but if not, click on Prompt Templates in the sidebar.

1. Click the New Template button in the upper right corner

Prompt Templates: New Template

2. Enter the Template Name: "Support Bot"

3. Enter the User Message: "You are a customer service agent. Provide helpful and friendly advice to the user query: {{user_query}}"

4. Click Create Template to create your template

Create New Template

5. You'll now see with Templates Details page displayed

Templates Details

Step 4: Test Your Template¶

Before creating evaluation criteria, let's quickly test our template using the Response Generator at the bottom of the Template Details page.

1. Scroll down to the Response Generator section at the bottom of the Template Details page, click to expand the section (it's collapsed by default), then click Generate Response to open the response generation form

Generate Response

2. Select an LLM Configuration from the dropdown (e.g., "Default GPT-4o-mini")

3. Enter a test value for the user_query variable; a simple greeting is fine to begin with: "Hi!"

4. Click Generate to see how your template performs

Generate Response

5. Review the generated response to ensure your template works as expected

Important: When you use the Response Generator, every value you enter for the variables will be automatically saved to your collection. This means you're not only testing your template but also building your test case collection at the same time!

Step 5: Add Evaluation Criteria¶

The Criteria section offers three ways to add evaluation criteria to your template:

Manually Add Criteria: Full control when you need specific evaluation requirements
Generate Criteria: Quick start with AI-generated criteria tailored to your specific prompt
Link Existing Criterion Sets...: Save time by reusing criteria from other templates in your project

1. In your Prompt Templates view, locate the Criteria section below the Response Generator

2. Click on Manually Add Criteria to enter your own evaluation criteria

Criteria

3. Input a criterion for evaluating responses, such as: "Is the response helpful?"

4. Click the checkmark to save your criterion

Added Criteria

Step 6: Check Your Template Variables Collection¶

Remember how we mentioned that the Response Generator automatically saves your variables? Let's check the default collection that was created for your template.

1. Navigate to Collections in the sidebar

2. You'll see that a collection called "Support Bot" has been automatically created as the default collection for your template

3. This collection already contains the test value we entered earlier (user_query: "Hi!") when we used the Response Generator

Collections

Step 7: Add More Template Variable values¶

Let's add more test cases to thoroughly evaluate our prompt template. You can see there are two ways to add values for variables at the bottom of the collection:

+ button: Manually add values for variables one by one with full control over each value
✨ (sparkles) button: Generate values for variables automatically using your prompt template and AI

For this tutorial, let's add a few more test cases manually:

1. Click the + button to manually add a value

Add Variables Buttons

2. Add a new "user_query" value: "Hey, what's up?"

3. Click Save to add this value to your collection

Add Variable

Repeat this process to add more user queries to build a comprehensive test set:

What do you do?
Can you tell me more about a product?
How's the weather in Bremen?

Variables Added

Step 8: Run Your First Experiment¶

Now that we have all the components in place (prompt template, evaluation criteria, and test cases), let's run a full experiment to evaluate our template systematically.

1. Navigate to Experiments in the sidebar

2. Click New Experiment

Experiments: New Experiment

3. Optionally, enter a Name: "Support Bot Evaluation" (a name will be auto-generated if left blank)

4. Add a Description (Optional): "Evaluating our support bot"

5. Select your Prompt Template: "Support Bot" (v1)

6. Choose your Template Variables Collection: "Support Bot"

7. Select a Model (e.g., "Default GPT-4o-mini")

8. Review the experiment settings (keep the defaults for this tutorial):

Generate automatically: ✓ (checked) - The experiment will immediately generate responses and ratings when created
Rating Mode: "detailed" - Includes reasoning for each evaluation criterion (recommended for better interpretability)
Number of Epochs: "1" - How many times to run the evaluation (higher numbers provide more reliable statistics)

9. Click Create and Run Experiment to start the experiment

Create New Experiment

10. The experiment will automatically generate responses for all your test cases and evaluate them against your criteria

Experiment Running

Step 9: View Your Evaluation Results¶

Once your experiment completes, you'll see comprehensive results showing how your prompt template performed across all test cases.

Evaluation Results

Understanding Your Results¶

Key Metrics at the Top:

Overall Score: The percentage of evaluation criteria that passed across all responses (in this example: 100%)
Average Tokens: Shows input tokens (↑) sent to the LLM and output tokens (↓) generated in responses
Response Time: Average time the LLM took to generate each response

Criteria Performance (Left Side):

The green bars show how well each evaluation criterion performed:

100% means all responses passed that specific criterion
Lower percentages indicate areas where your prompt could be improved
Different criteria may have different success rates - this helps you identify which aspects of your prompt work well and which need refinement

Distribution Charts (Right Side):

Output Tokens Distribution: Shows the spread of response lengths - helps identify consistency
Response Duration Distribution: Shows timing patterns - useful for performance optimization
Interactive Filtering: Click on any bar in these charts to filter the Sample Navigator to show only responses within that range

Analyzing Individual Responses¶

Sample Navigator (Bottom Section):

This powerful tool lets you examine each test case in detail:

Navigation: Use the arrow buttons or keyboard (←→) to browse through your test cases (the counter shows your current position)
Markdown Toggle: Switch between formatted and plain text views of responses
Sort By: Order results by rating, token count, or response time
Filter: Focus on specific criteria or hide perfect responses

Two Analysis Modes:

Detailed Analysis: In-depth view of individual responses with full context and criterion-by-criterion breakdown
Individual Responses: Table view showing all responses at once for quick comparison

Interpreting Your Results¶

Use these results to understand your prompt template's performance:

If you see high scores (80%+ across criteria):

Your prompt template is working well
Consider testing with more challenging or diverse examples
You can confidently use this template for similar tasks

If you see mixed or lower scores:

Look at the criteria with lower percentages to identify improvement areas
Use the Sample Navigator to examine specific failing cases
Refine your prompt template based on common failure patterns
Run additional experiments to test your improvements

Important for all results:

Spot-check passing samples - Even when responses pass your criteria, examine a random sample to verify your evaluation criteria align with what you actually consider successful
This helps ensure your criteria aren't too lenient or missing important quality aspects

Expanding Your Evaluation:

Regardless of your initial results, here are ways to improve your evaluation process:

Try different LLM configurations to compare performance
Add more diverse test cases to stress-test your template
Experiment with different evaluation criteria to capture other quality aspects

Next Steps¶

Now that you've completed your first evaluation using the web interface:

Set up Experiment Schedules to automatically run evaluations on a regular basis and get notified if performance drops
Follow our SDK Quick Start to learn how to run the same workflows using the Elluminate CLI and Python SDK
Explore Experiments to run systematic evaluations
Learn about Criterion Sets to create custom evaluation criteria
Try Batch Processing to evaluate multiple responses at once
Understand Key Concepts for a deeper dive into Elluminate's features

What You've Accomplished¶

✅ Created a prompt template with placeholders
✅ Tested your template using the Response Generator
✅ Added evaluation criteria manually
✅ Built a template variables collection with test cases
✅ Added additional test variables manually
✅ Run a complete experiment with systematic evaluation
✅ Analyzed comprehensive results using advanced tools
✅ Learned to interpret performance metrics and distributions

You're now ready to scale up your evaluation workflows with Elluminate!