Response Analysis¶
Master the art of analyzing and evaluating your LLM experiment results
After running an experiment, elluminate provides comprehensive tools to analyze your results. This guide covers how to understand experiment outcomes, inspect individual responses, analyze criterion performance, and work with detailed ratings and reasonings.
All analysis happens through the Experiments page in your project. Navigate to your experiment and use the two-tab interface to analyze results from different angles.
Understanding Experiment Results¶
Two Views of Experiment Data¶
Every experiment in elluminate provides two complementary perspectives on your results:
- Detailed Analysis: Aggregate view showing overall performance metrics and criterion-based breakdowns
- Individual Responses: Sample-by-sample inspection with full response content and detailed ratings
Key Metrics Explained¶
Pass Rates: Percentage of responses that met each criterion
- Pass (≥80%): Green badge, criterion consistently met
- Warning (50-79%): Yellow badge, inconsistent performance
- Fail (<50%): Red badge, criterion frequently unmet
Overall Performance: Aggregate scores showing experiment health
- Success Rate: Percentage of responses meeting all criteria
- Token Usage: Input and output token consumption
- Response Times: Average generation times per response
When you navigate to an experiment's detail page, you'll find the main sections for analyzing your results:
Experiment Overview¶
The overview section displays:
- Overall Performance Metrics: Success rates, token usage, response times
- Criteria Performance: Individual criterion pass/fail rates
- Generation Progress: Real-time status during experiment execution
- Error Summary: Failed responses and internal errors
- Charts: The Charts are interactive and allow for quick filtering of responses.
Detailed Analysis Tab¶
The "Detailed Analysis" tab provides sample-by-sample inspection of your experiment results.
Sample Navigation¶
- Navigate Responses: Browse individual prompt-response pairs
- Filter by Criteria: Focus on specific evaluation criteria
- Sort Options: Order by score, criteria performance, or chronologically
- Comparison Mode: When comparing experiments, view results side-by-side
Rating Details¶
Each response shows detailed evaluation information:
Criterion Performance¶
- Pass/Fail Indicators: Clear visual status for each criterion
- Performance Metrics: Success rates when viewing multiple responses
- Criterion Text: Full evaluation questions with variable substitution
Reasoning Analysis¶
When experiments use RatingMode.DETAILED
, each rating includes:
- Explanation Text: Why the response passed or failed each criterion
- Manual Override: Edit both ratings and reasoning if needed
- Rating Confidence: System confidence in the evaluation
Manual Rating Adjustments¶
You can manually adjust ratings when needed. For that, navigate to the Sample Navigator and click on View next to a Response to see a rating:
You can edit Ratings with the Edit button next to each individual response.
You can change pass/fail status for individual criteria and modify the explanation text
Individual Responses Tab¶
The "Individual Responses" tab provides a tabular view of all experiment results.
Response Table Features¶
- Sortable Columns: Order by score, tokens, response time, or other metrics
- Filtering Options: Show only specific score ranges or criterion results
- Bulk Operations: Export or analyze multiple responses simultaneously
- Response Preview: View full LLM responses without leaving the table
Export and Analysis¶
Export your results in various formats:
- CSV Export: Complete data including ratings, reasonings, and metadata
- Filtered Exports: Export only responses matching your current filters
- Response Content: Include full prompt and response text in exports
Advanced Analysis Workflows¶
Multi-Experiment Comparison¶
elluminate supports comparing multiple experiments to identify improvements and understand performance differences:
Two-Experiment Comparison¶
When comparing two experiments, you get detailed side-by-side analysis:
- Side-by-Side View: Compare responses to identical prompts
- Performance Deltas: See which experiment performed better per criterion
- Statistical Significance: Understand confidence in performance differences
Multi-Experiment Comparison¶
For comparing three or more experiments, you get a high-level overview:
- Performance Trending: See how experiments perform relative to each other
- Criterion Overview: Compare pass rates across multiple experiments
- Quick Selection: Easily select specific experiments for detailed comparison
Best Practices¶
Systematic Analysis Approach¶
- Start with Overview: Understand overall experiment performance
- Identify Patterns: Use charts to spot performance trends
- Drill Down: Investigate individual responses that represent patterns
- Document Insights: Use experiment comments to capture findings
- Iterate: Apply insights to improve prompts and evaluation criteria
Manual Override Guidelines¶
- Document Changes: Always explain why manual overrides were necessary
- Maintain Consistency: Apply similar standards across all manual reviews
- Regular Calibration: Periodically review override patterns for consistency
- Team Alignment: Ensure multiple reviewers apply consistent standards
Performance Optimization¶
- Balance Speed vs. Detail: Choose appropriate rating modes for your use case
- Strategic Sampling: Focus detailed analysis on representative samples
- Regular Reviews: Schedule periodic analysis sessions to maintain quality
Troubleshooting¶
Common Analysis Challenges¶
- Inconsistent Ratings: Check for prompt ambiguity or criterion clarity issues
- Low Performance: Analyze whether issues are prompt-related or model-related
- High Variance: Look for template variable values causing performance differences
- Export Issues: Verify filter settings and permissions before exporting
Getting Help¶
When analysis results seem unexpected:
- Review Criteria: Ensure evaluation criteria match your actual requirements
- Check Variables: Verify template variables are being substituted correctly
- Validate Prompts: Confirm prompts are generating expected response types
- Contact Support: Reach out with specific experiment IDs for detailed assistance at [email protected]