Structured Outputs¶

Structured outputs enable the developer to enforce that the LLM produces responses formatted in a programatically deterministic manner. Agentic programs make great use of this feature to enable interoperability code paths and LLM responses. This makes evaluating structured outputs an essential part for evaluating agents.

Basic Usage¶

An example showcasing using Pydantic models for structured output generation and evaluation:

import asyncio

from elluminate import Client
from elluminate.schemas import RatingMode
from pydantic import BaseModel, Field


class ProductReviewSentimentAnalysis(BaseModel):  # (1)!
    """Schema for structured sentiment analysis of reviews."""

    stars: int = Field(description="Number of stars of the review", ge=1, le=5)
    sentiment: str = Field(
        description="Overall sentiment: positive, negative, or neutral",
        pattern="^(positive|negative|neutral)$",
    )
    confidence: float = Field(description="Confidence score of the sentiment analysis between 0 and 1", ge=0, le=1)


async def main():
    client = Client()

    template, _ = await client.prompt_templates.aget_or_create(  # (2)!
        user_prompt_template="""Analyze this product review and extract key information:

Review: {{review_text}}

Provide review stars, sentiment and confidence score.""",
        name="Product Review Analysis",
        response_format=ProductReviewSentimentAnalysis,
    )

    collection, _ = await client.collections.aget_or_create(
        name="Product Review Data",
        variables=[
            {
                "review_text": "Stars: **** Great wireless headphones! Audio quality is fantastic and noise cancellation works perfectly. Battery could be better but overall very satisfied."
            },
            {
                "review_text": "Stars: ** Poor laptop experience. Screen flickered after 2 weeks, customer service was helpful, but would not recommend this product."
            },
        ],
    )

    await client.criteria.aadd_many(  # (3)!
        [
            "In the 'stars' field, is the counted number of stars correct?"
            "Does the 'sentiment' field accurately reflect the review's tone?",
            "Is the 'confidence' score appropriate for the certainty of the sentiment?",
        ],
        template,
        delete_existing=True,
    )

    experiment, _ = await client.experiments.aget_or_create(  # (4)!
        name="Review Analysis Experiment",
        prompt_template=template,
        collection=collection,
        generate=True,
        rating_mode=RatingMode.FAST,
        block=True,
        n_epochs=1,
    )

    for i, response in enumerate(experiment.rated_responses, 1):
        print(f"Example {i}:")
        print(f"Review: {response.prompt.template_variables.input_values['review_text'][:80]}...")
        print("Analysis:")

        for message in response.messages:
            if message["role"] == "assistant":
                print(f"{message['content']}")  # (5)!
        print()


if __name__ == "__main__":
    asyncio.run(main())

Define Schema: Create a Pydantic model with field descriptions and basic constraints to define the exact JSON structure you want the LLM to return
Create Template: Use the response_format parameter when creating a prompt template to specify that responses should follow your Pydantic model structure
Add Criteria: Define evaluation criteria that reference specific schema fields - criteria may also be auto-generated as per usual
Run Experiment: Create and run experiments normally - the structured output format will be enforced automatically for all response generations
Access Responses: The structured outputs can be found in the assistant message's content key as a JSON string

Schema Definition Methods¶

Pydantic Models¶

Pydantic models provide the most intuitive and recommended way to define structured output schemas. Simply set the response_format to the Pydantic class definition, and Elluminate handles the rest.

OpenAI JSON Schema Format¶

In addition to Pydantic models, you may also set the response_format directly with an OpenAI JSON Schema definition:

schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "sentiment",
        "schema": {
            "type": "object",
            "properties": {
                "stars": {
                    "type": "integer",
                    "description": "Number of stars of the review",
                    "minimum": 1,
                    "maximum": 5
                },
                "sentiment": {
                    "type": "string",
                    "description": "The sentiment output, could be positive, negative, or neutral.",
                    "enum": [
                        "positive",
                        "negative",
                        "neutral"
                    ]
                },
                "confidence": {
                    "type": "number",
                    "description": "Confidence score of the sentiment analysis between 0 and 1",
                    "minimum": 0,
                    "maximum": 1
                }
            },
            "required": [
                "stars",
                "sentiment",
                "confidence"
            ],
            "additionalProperties": False
        }
    }
}

AI-Powered Schema Generation¶

The frontend provides an AI-powered schema generator that creates JSON schemas from natural language descriptions. Simply describe what you want to extract, and Elluminate will generate an appropriate schema.

Evaluating Structured Outputs¶

The rating model has access to all field descriptions from your structured output schema, providing valuable context about what each field should contain and how it should be interpreted. Subsequently to evaluate structured outputs, simply create criteria and run an experiment as per usual.

Using Field Names in Criteria

It may be beneficial to use field names from your schema in the criteria. This helps the rating model understand exactly which part of the JSON structure to focus on. For example, "Does the 'sentiment' field..." is more precise than "Is the sentiment correct?"