Skip to content

Structured Outputs

Master the evaluation of programmatically formatted LLM responses essential for agentic applications

Structured outputs enable the developer to enforce that the LLM produces responses formatted in a programatically deterministic manner. Agentic programs make great use of this feature to enable interoperability code paths and LLM responses. This makes evaluating structured outputs an essential part for evaluating agents.

Basic Usage

An example showcasing using Pydantic models for structured output generation and evaluation:

"""v1.0 API: Structured Outputs Example

Demonstrates how to use Pydantic models as response_format for structured
LLM outputs. The model analyzes product reviews and extracts structured
sentiment analysis data.

v1.0 API changes:
- client.prompt_templates.aget_or_create() -> client.get_or_create_prompt_template()
- client.collections.aget_or_create() -> client.get_or_create_collection()
- client.criteria.aadd_many() -> criterion_set.add_criteria()
- client.experiments.aget_or_create() -> client.run_experiment()
- Sync-first approach (no asyncio.run needed)

INSIGHT: Structured outputs use Pydantic models to enforce response schemas.
The LLM is instructed to return JSON matching the schema, and the response
is validated against it. This is powerful for data extraction tasks.
"""

from dotenv import load_dotenv
from elluminate import Client
from elluminate.schemas import RatingMode
from pydantic import BaseModel, Field

load_dotenv(override=True)


class ProductReviewSentimentAnalysis(BaseModel):
    """Schema for structured sentiment analysis of reviews."""

    stars: int = Field(description="Number of stars of the review", ge=1, le=5)
    sentiment: str = Field(
        description="Overall sentiment: positive, negative, or neutral",
        pattern="^(positive|negative|neutral)$",
    )
    confidence: float = Field(
        description="Confidence score of the sentiment analysis between 0 and 1",
        ge=0,
        le=1,
    )


def main():
    client = Client()

    print("v1.0: Structured Outputs Example")
    print("=" * 50)

    # v1.0: get_or_create_prompt_template with response_format
    template, created = client.get_or_create_prompt_template(
        name="Product Review Analysis v1",
        messages="""Analyze this product review and extract key information:

Review: {{review_text}}

Provide review stars, sentiment and confidence score.""",
        response_format=ProductReviewSentimentAnalysis,
    )
    print(f"Template: {'Created' if created else 'Found existing'}")

    # v1.0: get_or_create_collection with variables at creation time
    collection, created = client.get_or_create_collection(
        name="Product Review Data v1",
    )
    if created:
        collection.add_many(
            variables=[
                {
                    "review_text": "Stars: **** Great wireless headphones! Audio quality is fantastic and noise cancellation works perfectly. Battery could be better but overall very satisfied."
                },
                {
                    "review_text": "Stars: ** Poor laptop experience. Screen flickered after 2 weeks, customer service was helpful, but would not recommend this product."
                },
            ]
        )
    print(f"Collection: {'Created' if created else 'Found existing'}")

    # v1.0: get_or_create_criterion_set with add_criteria
    criterion_set, created = client.get_or_create_criterion_set(name="Review Analysis Criteria v1")
    if created:
        criterion_set.add_criteria(
            [
                "In the 'stars' field, is the counted number of stars correct?",
                "Does the 'sentiment' field accurately reflect the review's tone?",
                "Is the 'confidence' score appropriate for the certainty of the sentiment?",
            ]
        )
    print(f"Criterion set: {'Created' if created else 'Found existing'}")

    # v1.0: run_experiment - creates and runs in one call
    print("\nRunning experiment...")
    experiment = client.run_experiment(
        name="Review Analysis Experiment v1",
        prompt_template=template,
        collection=collection,
        criterion_set=criterion_set,
        rating_mode=RatingMode.FAST,
        n_epochs=1,
    )

    print(f"\nExperiment completed: {experiment.name}")
    print(f"Total responses: {len(experiment.rated_responses)}")

    # Display results
    for i, response in enumerate(experiment.responses(), 1):
        print(f"\n--- Example {i} ---")
        review_text = response.prompt.template_variables.input_values["review_text"]
        print(f"Review: {review_text[:80]}...")
        print("Analysis:")
        for message in response.messages:
            if message.role == "assistant":
                print(f"  {message.content}")

    # Show aggregated results
    if experiment.result:
        print("\n--- Results ---")
        print(f"Pass rate: {experiment.result.mean_all_ratings.yes:.2%}")


# =========================================================================
# Migration Insights
# =========================================================================
#
# 1. RESPONSE FORMAT
#    Both versions: Pass Pydantic model class as response_format
#    v1.0: Pass as keyword argument (part of template identity)
#    The SDK automatically converts to JSON schema for the API
#
# 2. CRITERIA SETUP
#    v0.x: client.criteria.aadd_many([...], template, delete_existing=True)
#    v1.0: criterion_set.add_criteria([...])
#    Note: v1.0 links criteria to criterion_set, not directly to template
#    The criterion_set is then linked to the template via experiment
#
# 3. DELETE_EXISTING PATTERN
#    v0.x: delete_existing=True to replace criteria
#    v1.0: Create new criterion_set or check if created before adding
#    This is a cleaner pattern that doesn't destroy existing data
#
# 4. PYDANTIC FIELD CONSTRAINTS
#    Both versions support Field() with:
#    - ge/le for numeric ranges
#    - pattern for string regex validation
#    - description for LLM guidance
#    These constraints help the LLM generate valid responses
#
# 5. SDK ENHANCEMENT OPPORTUNITY
#    The v0.x delete_existing=True pattern is useful for iteration.
#    Consider adding criterion_set.clear() or criterion_set.replace_criteria()
#    for easier iteration during development.
#


if __name__ == "__main__":
    main()

  1. Define Schema: Create a Pydantic model with field descriptions and basic constraints to define the exact JSON structure you want the LLM to return
  2. Create Template: Use the response_format parameter when creating a prompt template to specify that responses should follow your Pydantic model structure
  3. Add Criteria: Define evaluation criteria that reference specific schema fields - criteria may also be auto-generated as per usual
  4. Run Experiment: Create and run experiments normally - the structured output format will be enforced automatically for all response generations
  5. Access Responses: The structured outputs can be found in the assistant message's content key as a JSON string

Schema Definition Methods

Pydantic Models

Pydantic models provide the most intuitive and recommended way to define structured output schemas. Simply set the response_format to the Pydantic class definition, and elluminate handles the rest.

OpenAI JSON Schema Format

In addition to Pydantic models, you may also set the response_format directly with an OpenAI JSON Schema definition:

schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "sentiment",
        "schema": {
            "type": "object",
            "properties": {
                "stars": {
                    "type": "integer",
                    "description": "Number of stars of the review",
                    "minimum": 1,
                    "maximum": 5
                },
                "sentiment": {
                    "type": "string",
                    "description": "The sentiment output, could be positive, negative, or neutral.",
                    "enum": [
                        "positive",
                        "negative",
                        "neutral"
                    ]
                },
                "confidence": {
                    "type": "number",
                    "description": "Confidence score of the sentiment analysis between 0 and 1",
                    "minimum": 0,
                    "maximum": 1
                }
            },
            "required": [
                "stars",
                "sentiment",
                "confidence"
            ],
            "additionalProperties": False
        }
    }
}

AI-Powered Schema Generation

Schema Generation Dialog

The frontend provides an AI-powered schema generator that creates JSON schemas from natural language descriptions. Simply describe what you want to extract, and elluminate will generate an appropriate schema.

Evaluating Structured Outputs

The rating model has access to all field descriptions from your structured output schema, providing valuable context about what each field should contain and how it should be interpreted. Subsequently to evaluate structured outputs, simply create criteria and run an experiment as per usual.

Using Field Names in Criteria

It may be beneficial to use field names from your schema in the criteria. This helps the rating model understand exactly which part of the JSON structure to focus on. For example, "Does the 'sentiment' field..." is more precise than "Is the sentiment correct?"