Skip to content

Response Analysis

Master the art of analyzing and evaluating your LLM experiment results

After running an experiment, elluminate provides comprehensive tools to analyze your results. This guide covers how to understand experiment outcomes, inspect individual responses, analyze criterion performance, and work with detailed ratings and reasonings.

All analysis happens through the Experiments page in your project. Navigate to your experiment and use the two-tab interface to analyze results from different angles.

Understanding Experiment Results

Two Views of Experiment Data

Every experiment in elluminate provides two complementary perspectives on your results:

Single Experiment Detailed Analysis

  1. Detailed Analysis: Aggregate view showing overall performance metrics and criterion-based breakdowns
  2. Individual Responses: Sample-by-sample inspection with full response content and detailed ratings

Individual Responses Tab

Key Metrics Explained

Pass Rates: Percentage of responses that met each criterion

  • Pass (≥80%): Green badge, criterion consistently met
  • Warning (50-79%): Yellow badge, inconsistent performance
  • Fail (<50%): Red badge, criterion frequently unmet

Overall Performance: Aggregate scores showing experiment health

  • Success Rate: Percentage of responses meeting all criteria
  • Token Usage: Input and output token consumption
  • Response Times: Average generation times per response

When you navigate to an experiment's detail page, you'll find the main sections for analyzing your results:

Experiment Overview

The overview section displays:

  • Overall Performance Metrics: Success rates, token usage, response times
  • Criteria Performance: Individual criterion pass/fail rates
  • Generation Progress: Real-time status during experiment execution
  • Error Summary: Failed responses and internal errors
  • Charts: The Charts are interactive and allow for quick filtering of responses.

Detailed Analysis Tab

The "Detailed Analysis" tab provides sample-by-sample inspection of your experiment results.

Sample Navigation

  • Navigate Responses: Browse individual prompt-response pairs
  • Filter by Criteria: Focus on specific evaluation criteria
  • Sort Options: Order by score, criteria performance, or chronologically
  • Comparison Mode: When comparing experiments, view results side-by-side

Rating Details

Each response shows detailed evaluation information:

Criterion Performance

  • Pass/Fail Indicators: Clear visual status for each criterion
  • Performance Metrics: Success rates when viewing multiple responses
  • Criterion Text: Full evaluation questions with variable substitution

Reasoning Analysis

When experiments use RatingMode.DETAILED, each rating includes:

  • Explanation Text: Why the response passed or failed each criterion
  • Manual Override: Edit both ratings and reasoning if needed
  • Rating Confidence: System confidence in the evaluation

Manual Rating Adjustments

You can manually adjust ratings when needed. For that, navigate to the Sample Navigator and click on View next to a Response to see a rating: Sample Navigator View

You can edit Ratings with the Edit button next to each individual response. Edit Rating

You can change pass/fail status for individual criteria and modify the explanation text

Individual Responses Tab

The "Individual Responses" tab provides a tabular view of all experiment results.

Response Table Features

  • Sortable Columns: Order by score, tokens, response time, or other metrics
  • Filtering Options: Show only specific score ranges or criterion results
  • Bulk Operations: Export or analyze multiple responses simultaneously
  • Response Preview: View full LLM responses without leaving the table

Export and Analysis

Export your results in various formats:

  • CSV Export: Complete data including ratings, reasonings, and metadata
  • Filtered Exports: Export only responses matching your current filters
  • Response Content: Include full prompt and response text in exports

Advanced Analysis Workflows

Multi-Experiment Comparison

elluminate supports comparing multiple experiments to identify improvements and understand performance differences:

Two-Experiment Comparison

When comparing two experiments, you get detailed side-by-side analysis:

Two Experiment Comparison

  • Side-by-Side View: Compare responses to identical prompts
  • Performance Deltas: See which experiment performed better per criterion
  • Statistical Significance: Understand confidence in performance differences

Multi-Experiment Comparison

For comparing three or more experiments, you get a high-level overview:

Multi Experiment Comparison

  • Performance Trending: See how experiments perform relative to each other
  • Criterion Overview: Compare pass rates across multiple experiments
  • Quick Selection: Easily select specific experiments for detailed comparison

Best Practices

Systematic Analysis Approach

  1. Start with Overview: Understand overall experiment performance
  2. Identify Patterns: Use charts to spot performance trends
  3. Drill Down: Investigate individual responses that represent patterns
  4. Document Insights: Use experiment comments to capture findings
  5. Iterate: Apply insights to improve prompts and evaluation criteria

Manual Override Guidelines

  • Document Changes: Always explain why manual overrides were necessary
  • Maintain Consistency: Apply similar standards across all manual reviews
  • Regular Calibration: Periodically review override patterns for consistency
  • Team Alignment: Ensure multiple reviewers apply consistent standards

Performance Optimization

  • Balance Speed vs. Detail: Choose appropriate rating modes for your use case
  • Strategic Sampling: Focus detailed analysis on representative samples
  • Regular Reviews: Schedule periodic analysis sessions to maintain quality

Troubleshooting

Common Analysis Challenges

  • Inconsistent Ratings: Check for prompt ambiguity or criterion clarity issues
  • Low Performance: Analyze whether issues are prompt-related or model-related
  • High Variance: Look for template variable values causing performance differences
  • Export Issues: Verify filter settings and permissions before exporting

Getting Help

When analysis results seem unexpected:

  • Review Criteria: Ensure evaluation criteria match your actual requirements
  • Check Variables: Verify template variables are being substituted correctly
  • Validate Prompts: Confirm prompts are generating expected response types
  • Contact Support: Reach out with specific experiment IDs for detailed assistance at [email protected]