Response Analysis¶

Master the art of analyzing and evaluating your LLM experiment results

After running an experiment, elluminate provides comprehensive tools to analyze your results. This guide covers how to understand experiment outcomes, inspect individual responses, analyze criterion performance, and work with detailed ratings and reasonings.

All analysis happens through the Experiments page in your project. Navigate to your experiment and use the two-tab interface to analyze results from different angles.

Understanding Experiment Results¶

Two Views of Experiment Data¶

Every experiment in elluminate provides two complementary perspectives on your results:

Single Experiment Detailed Analysis

Detailed Analysis - Aggregate view showing overall performance metrics and criterion-based breakdowns
Individual Responses - Sample-by-sample inspection with full response content and detailed ratings

Individual Responses Tab

Key Metrics Explained¶

Pass Rates: Percentage of responses that met each criterion

Pass (≥80%) - Green badge, criterion consistently met
Warning (50-79%) - Yellow badge, inconsistent performance
Fail (<50%) - Red badge, criterion frequently unmet

Overall Performance: Aggregate scores showing experiment health

Success Rate - Percentage of responses meeting all criteria
Token Usage - Input and output token consumption
Response Times - Average generation times per response

When you navigate to an experiment's detail page, you'll find the main sections for analyzing your results:

Experiment Overview¶

The overview section displays:

Overall Performance Metrics - Success rates, token usage, response times
Criteria Performance - Individual criterion pass/fail rates
Generation Progress - Real-time status during experiment execution
Error Summary - Failed responses and internal errors
Charts - The Charts are interactive and allow for quick filtering of responses.

Detailed Analysis Tab¶

The "Detailed Analysis" tab provides sample-by-sample inspection of your experiment results.

Navigate Responses - Browse individual prompt-response pairs
Filter by Criteria - Focus on specific evaluation criteria
Sort Options - Order by score, criteria performance, or chronologically
Comparison Mode - When comparing experiments, view results side-by-side

Rating Details¶

Each response shows detailed evaluation information:

Criterion Performance¶

Pass/Fail Indicators - Clear visual status for each criterion
Performance Metrics - Success rates when viewing multiple responses
Criterion Text - Full evaluation questions with variable substitution

Reasoning Analysis¶

When experiments use RatingMode.DETAILED, each rating includes:

Explanation Text - Why the response passed or failed each criterion
Manual Override - Edit both ratings and reasoning if needed
Rating Confidence - System confidence in the evaluation

Individual Responses Tab¶

The "Individual Responses" tab provides a tabular view of all experiment results.

Response Table Features¶

Sortable Columns - Order by score, tokens, response time, or other metrics
Filtering Options - Show only specific score ranges or criterion results
Bulk Operations - Export or analyze multiple responses simultaneously
Response Preview - View full LLM responses without leaving the table

Export and Analysis¶

Export your results, including ratings, reasonings, and metadata, in various formats:

XLSX Export
JSON Export
XML Copy

Advanced Analysis Workflows¶

Multi-Experiment Comparison¶

elluminate supports comparing multiple experiments to identify improvements and understand performance differences.

Two-Experiment Comparison¶

When comparing two experiments, you get a detailed side-by-side analysis:

Two Experiment Comparison

Side-by-Side View - Compare responses to identical prompts
Performance Deltas - See which experiment performed better per criterion
Statistical Significance - Understand confidence in performance differences

Multi-Experiment Comparison¶

For comparing three or more experiments, you get a high-level overview:

Multi Experiment Comparison

Performance Trending - See how experiments perform relative to each other
Criterion Overview - Compare pass rates across multiple experiments
Quick Selection - Easily select specific experiments for detailed comparison

Best Practices¶

Systematic Analysis Approach¶

Start with Overview - Understand overall experiment performance
Identify Patterns - Use charts to spot performance trends
Drill Down - Investigate individual responses that represent patterns
Document Insights - Use experiment comments to capture findings
Iterate - Apply insights to improve prompts and evaluation criteria

Manual Override Guidelines¶

Document Changes - Always explain why manual overrides were necessary
Maintain Consistency - Apply similar standards across all manual reviews
Regular Calibration - Periodically review override patterns for consistency
Team Alignment - Ensure multiple reviewers apply consistent standards

Performance Optimization¶

Balance Speed vs. Detail - Choose appropriate rating modes for your use case
Strategic Sampling - Focus detailed analysis on representative samples
Regular Reviews - Schedule periodic analysis sessions to maintain quality

Troubleshooting¶

Common Analysis Challenges¶

Inconsistent Ratings - Check for prompt ambiguity or criterion clarity issues
Low Performance - Analyze whether issues are prompt-related or model-related
High Variance - Look for template variable values causing performance differences
Export Issues - Verify filter settings and permissions before exporting

Getting Help¶

When analysis results seem unexpected:

Review Criteria - Ensure evaluation criteria match your actual requirements
Check Variables - Verify template variables are being substituted correctly
Validate Prompts - Confirm prompts are generating expected response types
Contact Support - Reach out with specific experiment IDs for detailed assistance at [email protected]

Response Analysis¶

Understanding Experiment Results¶

Two Views of Experiment Data¶

Key Metrics Explained¶

Experiment Overview¶

Detailed Analysis Tab¶

Sample Navigation¶

Rating Details¶

Criterion Performance¶

Reasoning Analysis¶

Individual Responses Tab¶

Response Table Features¶

Export and Analysis¶

Advanced Analysis Workflows¶

Multi-Experiment Comparison¶

Two-Experiment Comparison¶

Multi-Experiment Comparison¶

Best Practices¶

Systematic Analysis Approach¶

Manual Override Guidelines¶

Performance Optimization¶

Troubleshooting¶

Common Analysis Challenges¶

Getting Help¶