Skip to content

Agentic Evaluations

Evaluate autonomous agents end-to-end: tasks, trajectories, trace-based criteria, and aggregate agent metrics.

Agentic evaluations extend elluminate beyond single-turn LLM outputs to cover agents that plan, call tools, and work against a task description over many steps. You run the agent externally (with Harbor, a custom harness, or any framework whose output you can translate to the ATIF trajectory format) and upload the trial results, including full trajectories, to elluminate. elluminate then rates each criterion against the trajectory, and the UI surfaces the trace, per-criterion ratings, and aggregate metrics.

When to use agentic evaluations

Use this workflow when:

  • Your system makes multiple LLM calls per task (tool use, plan/act loops, sub-agents).
  • Evaluation needs to look at what the agent did, not only at its final message.
  • You already have (or want to keep) an external runner, e.g. Harbor, LangChain, CrewAI, AutoGen, or your own code.

For single-turn outputs, or tool-calling patterns where elluminate generates the responses itself, see the Tool Calling guide instead.

elluminate does not execute your agent

Agentic evaluations cover uploading and rating external agent runs. You run the agent yourself with Harbor, a custom harness, or another framework; elluminate stores the trial results, renders the trajectory viewer, and (optionally) rates each criterion against the trajectory.

Trajectory format

elluminate accepts trajectories in its own ATIF format (v1.*) only. No native importer exists for LangChain, CrewAI, AutoGen, or any other framework — you translate your runner's output to ATIF and upload via the SDK. The example in the SDK Reference section below shows the translation for a Harbor-shaped per-task output; the same pattern applies to any other framework.

UI workflow

Agentic experiment overview

  1. Create an AGENTIC collection. Each row is one task description. The collection's collection_type is set to AGENTIC, which unlocks the trajectory viewer and agentic metrics downstream. Because AGENTIC experiments never auto-generate responses, the collection uses a single RAW_INPUT column for the task text; no prompt template is required.
  2. Create a criterion set. Each criterion is a binary YES/NO question elluminate will answer against the trajectory (e.g. "Did the agent edit the correct file?").
  3. Create an AGENTIC experiment linking the template, collection, and criterion set. The experiment's evaluation_mode is set to AGENTIC, which disables auto-generation; responses come from your external runner.
  4. Run the agent externally (Harbor, your own script) and collect one result per task, including an ATIF trajectory.
  5. Upload the results via the SDK. When trajectories are present, elluminate automatically rates each criterion against the trajectory.
  6. Inspect the results in the UI: trajectory viewer, per-criterion ratings with reasoning, and aggregate metrics (mean reward, mean execution time, mean cost).

Trajectory viewer

Concept mapping: Harbor to elluminate

If you are coming from Harbor, the concepts roughly map as follows:

Harbor elluminate
Dataset Collection (collection_type: AGENTIC)
Task Row in the collection (one TemplateVariables with a RAW_INPUT task column)
instruction.md The task text stored in the row's RAW_INPUT task column
task.toml env config Collection environment_config (optional)
Reward AgentTrialResult.reward
Trajectory AgentTrialResult.trajectory (ATIF)
Test / judge scripts Criterion set (evaluated against the trajectory)
Agent Experiment evaluation_mode: AGENTIC

AgentTrialResult payload

Each trial your runner produces maps to one AgentTrialResult. Required and optional fields:

Field Required Description
task_name yes Must exactly match a value in the collection column given as task_name_column.
messages no Final OpenAI-format message list (shown on the response page).
reward no Primary reward score (0.0–1.0).
steps no Number of agent steps / LLM calls.
cost_usd no Total USD cost for the trial.
duration_seconds no Wall-clock duration.
input_tokens no Aggregate input tokens.
output_tokens no Aggregate output tokens.
cached_tokens no Aggregate cached input tokens.
error no Error message if the trial failed.
metadata no Free-form dict surfaced on the response page.
trajectory no Raw ATIF trajectory (validated by the backend; see ATIF format).
criterion_ratings no Pre-computed ratings. Skip elluminate's evaluation by providing your own (see Evaluation modes).

ATIF trajectory format

Trajectories use the Agent Trajectory Interchange Format (ATIF). The backend validates the payload on upload but stores it raw, so extra keys are preserved verbatim for the trajectory viewer.

A minimal ATIF v1 trajectory:

{
  "schema_version": "ATIF-v1.0",
  "session_id": "harbor-run-001/write-hello-world",
  "agent": {
    "name": "harbor-demo-agent",
    "version": "0.1.0",
    "model_name": "claude-sonnet-4-6"
  },
  "steps": [
    {
      "step_id": 1,
      "source": "user",
      "message": "Write a Python hello world script to hello.py"
    },
    {
      "step_id": 2,
      "source": "agent",
      "message": "Writing hello.py.",
      "tool_calls": [
        {
          "tool_call_id": "tc_1",
          "function_name": "write_file",
          "arguments": {"path": "hello.py", "content": "print('Hello, World!')"}
        }
      ],
      "observation": {
        "results": [{"source_call_id": "tc_1", "content": "wrote 22 bytes"}]
      },
      "metrics": {"prompt_tokens": 420, "completion_tokens": 61, "cost_usd": 0.0042}
    }
  ],
  "final_metrics": {
    "total_steps": 2,
    "total_cost_usd": 0.0042,
    "total_prompt_tokens": 420,
    "total_completion_tokens": 61
  }
}

Evaluation modes

Agentic evaluations support two complementary paths for producing ratings.

Automatic evaluation (default)

Upload trajectories with evaluate=True (the default). For each criterion in the criterion set, elluminate reads the trajectory and emits a YES/NO rating with reasoning. Use this path when you want elluminate to do the judging.

Pre-computed ratings

Attach your own ratings to each trial via criterion_ratings and set evaluate=False. elluminate stores them as-is. Use this path when you already run an external judge, or when you want to import historical runs without re-evaluating. Labels that don't yet exist on the experiment's criterion set are created on upload, so this path can also seed new criteria (useful for backfilling runs whose criteria weren't declared up front).

Mixing modes

You can also leave evaluate=True while supplying criterion_ratings. Pre-computed ratings are stored immediately and elluminate evaluates any criteria that were not pre-rated. This is useful for partial imports.

Limitations

  • No execution: elluminate does not run your agent. Any runner operates outside elluminate; this guide covers uploading the results.
  • Schema version: trajectories must set schema_version to ATIF-v1.*. Unknown versions are rejected.

SDK Reference

The following script covers the full end-to-end flow: AGENTIC collection, AGENTIC experiment, conversion of runner output, and upload with elluminate's evaluation queued. The script is idempotent — collections, criterion sets, and experiments are reused across runs, and uploads are skipped when an experiment already contains responses.

Running the example

Set your API key (created in the elluminate UI under Project → Keys) either as an environment variable or in a .env file next to the script; the example calls load_dotenv():

# option 1: shell
export ELLUMINATE_API_KEY=<your-key>
# optionally, if you run elluminate on a non-default host:
# export ELLUMINATE_BASE_URL=https://your-instance.example.com

# option 2: .env in elluminate_sdk/examples/
echo "ELLUMINATE_API_KEY=<your-key>" > elluminate_sdk/examples/.env

# run
uv run --directory elluminate_sdk python examples/example_harbor_agentic_upload.py
"""Harbor-based Agentic Evaluation: end-to-end upload example.

This example shows the full workflow for evaluating an agent that was run
externally with Harbor (or any other agent framework), and uploading the
results, including ATIF trajectories, to elluminate for inspection and
automatic per-criterion evaluation.

Workflow:

1. Create an AGENTIC collection whose rows are the agent's tasks. The task
   description lives in a single RAW_INPUT column; no prompt template is
   needed because AGENTIC experiments do not auto-generate responses.
2. Create a criterion set describing what counts as success.
3. Create an AGENTIC experiment (no auto-generation; results are uploaded).
4. Run the agent externally (Harbor CLI, LangChain, CrewAI, custom code).
5. Read Harbor's per-task output and convert it into `AgentTrialResult` objects.
6. Upload via `experiment.upload_agent_results(...)`.
7. The backend stores the trajectories and, when `evaluate=True` and
   trajectories are present, elluminate automatically rates each criterion
   against the trajectory.

The script is idempotent: collections, criterion sets, and experiments are
reused across runs; uploads are skipped when an experiment already contains
responses.

For a self-contained demo this script uses a small in-memory stand-in for
Harbor's output. In a real run you would point `load_harbor_run()` at the
directory Harbor writes to (`~/.harbor/runs/<run_name>/tasks/<task>/...`).
"""

from typing import Any

from dotenv import load_dotenv
from elluminate import AgentTrialResult, Client, CriterionRatingIn
from elluminate.schemas import CollectionColumn, ColumnTypeEnum
from elluminate.schemas.criterion import CriterionIn
from elluminate.schemas.experiments import Experiment

load_dotenv(override=True)

client = Client()  # (1)!
llm_config = client.get_llm_config(name="Claude Sonnet 4.6")

# Mock "Harbor output"; in a real integration this is read from disk.  # (2)!
# Each entry is what a Harbor run produces per task: a short task identifier,
# the instruction text, final messages, aggregate metrics, and an ATIF
# trajectory describing every step the agent took.
HARBOR_RUN: list[dict[str, Any]] = [
    {
        "task_name": "write-hello-world",
        "instruction": "Write a Python hello world script to hello.py",
        "reward": 1.0,
        "steps": 2,
        "cost_usd": 0.0042,
        "input_tokens": 420,
        "output_tokens": 61,
        "duration_seconds": 3.2,
        "messages": [
            {"role": "user", "content": "Write a Python hello world script to hello.py"},
            {"role": "assistant", "content": "Wrote hello.py: print('Hello, World!')"},
        ],
        "trajectory": {
            "schema_version": "ATIF-v1.0",
            "session_id": "harbor-run-001/write-hello-world",
            "agent": {
                "name": "harbor-demo-agent",
                "version": "0.1.0",
                "model_name": "claude-sonnet-4-6",
            },
            "steps": [
                {
                    "step_id": 1,
                    "source": "user",
                    "message": "Write a Python hello world script to hello.py",
                },
                {
                    "step_id": 2,
                    "source": "agent",
                    "message": "Writing hello.py.",
                    "tool_calls": [
                        {
                            "tool_call_id": "tc_1",
                            "function_name": "write_file",
                            "arguments": {"path": "hello.py", "content": "print('Hello, World!')"},
                        }
                    ],
                    "observation": {
                        "results": [{"source_call_id": "tc_1", "content": "wrote 22 bytes"}],
                    },
                    "metrics": {"prompt_tokens": 420, "completion_tokens": 61, "cost_usd": 0.0042},
                },
            ],
            "final_metrics": {
                "total_steps": 2,
                "total_cost_usd": 0.0042,
                "total_prompt_tokens": 420,
                "total_completion_tokens": 61,
            },
        },
    },
    {
        "task_name": "reverse-string-function",
        "instruction": "Create a Python function that reverses a string in reverse.py",
        "reward": 0.5,
        "steps": 2,
        "cost_usd": 0.0031,
        "input_tokens": 310,
        "output_tokens": 42,
        "duration_seconds": 2.1,
        "messages": [
            {"role": "user", "content": "Create a Python function that reverses a string in reverse.py"},
            {"role": "assistant", "content": "Wrote reverse.py with a one-line slice-based reverse."},
        ],
        "trajectory": {
            "schema_version": "ATIF-v1.0",
            "session_id": "harbor-run-001/reverse-string-function",
            "agent": {
                "name": "harbor-demo-agent",
                "version": "0.1.0",
                "model_name": "claude-sonnet-4-6",
            },
            "steps": [
                {
                    "step_id": 1,
                    "source": "user",
                    "message": "Create a Python function that reverses a string in reverse.py",
                },
                {
                    "step_id": 2,
                    "source": "agent",
                    "message": "Writing reverse.py.",
                    "tool_calls": [
                        {
                            "tool_call_id": "tc_1",
                            "function_name": "write_file",
                            "arguments": {
                                "path": "reverse.py",
                                "content": "def reverse(s: str) -> str:\n    return s[::-1]\n",
                            },
                        }
                    ],
                    "observation": {
                        "results": [{"source_call_id": "tc_1", "content": "wrote 42 bytes"}],
                    },
                    "metrics": {"prompt_tokens": 310, "completion_tokens": 42, "cost_usd": 0.0031},
                },
            ],
            "final_metrics": {
                "total_steps": 2,
                "total_cost_usd": 0.0031,
                "total_prompt_tokens": 310,
                "total_completion_tokens": 42,
            },
        },
    },
]


def harbor_to_agent_trial(task_output: dict[str, Any]) -> AgentTrialResult:  # (3)!
    """Map one Harbor per-task output dict to an `AgentTrialResult`.

    `task_name` on `AgentTrialResult` is what elluminate matches against the
    collection's `task_name_column` value, so here we set it to the full
    instruction text (which is also what the `task` column row holds).
    """
    return AgentTrialResult(
        task_name=task_output["instruction"],
        messages=task_output["messages"],
        reward=task_output["reward"],
        steps=task_output["steps"],
        cost_usd=task_output["cost_usd"],
        input_tokens=task_output["input_tokens"],
        output_tokens=task_output["output_tokens"],
        duration_seconds=task_output["duration_seconds"],
        trajectory=task_output["trajectory"],
        metadata={"run_name": "harbor-run-001", "task_id": task_output["task_name"]},
    )


# Step 1: AGENTIC collection with a single RAW_INPUT `task` column.  # (4)!
# No prompt template is required because AGENTIC experiments never
# auto-generate; responses are supplied by `upload_agent_results`.
collection, _ = client.get_or_create_collection(
    name="Harbor Demo Tasks",
    defaults={
        "collection_type": "AGENTIC",
        "columns": [CollectionColumn(name="task", column_type=ColumnTypeEnum.RAW_INPUT)],
        "variables": [{"task": h["instruction"]} for h in HARBOR_RUN],
    },
)

# Step 2: criterion set defining what success looks like for these tasks.  # (5)!
# Labels let pre-computed ratings (Option B) refer to criteria unambiguously.
criterion_set, _ = client.get_or_create_criterion_set(
    name="Harbor Demo Criteria",
    defaults={
        "criteria": [
            CriterionIn(
                criterion_str="Did the agent correctly complete the requested task?",
                label="task-complete",
            ),
            CriterionIn(
                criterion_str="Did the agent use tools appropriately?",
                label="uses-tools",
            ),
            CriterionIn(
                criterion_str="Is the agent's final output correct?",
                label="output-correct",
            ),
        ],
    },
)


def get_or_create_agentic_experiment(name: str, description: str) -> tuple[Experiment, bool]:  # (6)!
    """Return an AGENTIC experiment, creating it if missing.

    Also reports whether the experiment already has uploaded responses so the
    caller can skip a redundant upload on re-runs (avoids epoch conflicts).
    """
    try:
        experiment = client.get_experiment(name=name, fetch_responses=False)
        populated = experiment.results is not None and experiment.results.completed_epochs > 0
        return experiment, populated
    except ValueError:
        experiment = client.create_experiment(
            name=name,
            collection=collection,
            prompt_template=None,
            criterion_set=criterion_set,
            description=description,
            evaluation_mode="AGENTIC",
            llm_config=llm_config,
        )
        return experiment, False


# Step 3: AGENTIC experiment. No auto-generation; results come from Harbor.  # (7)!
experiment, experiment_populated = get_or_create_agentic_experiment(
    "Harbor Demo — Agent Run",
    "Harbor-run coding agent with ATIF trajectories.",
)
print(f"Experiment: {experiment.name} (id={experiment.id})")

# Step 4: Convert Harbor output to `AgentTrialResult` objects.  # (8)!
results = [harbor_to_agent_trial(task_output) for task_output in HARBOR_RUN]

# Step 5 (option A): upload with `evaluate=True`. elluminate rates every  # (9)!
# criterion against the trajectory and fills in per-criterion ratings.
if experiment_populated:
    print("Experiment already has responses; skipping Option A upload.")
else:
    upload = experiment.upload_agent_results(
        results=results,
        task_name_column="task",
        evaluate=True,
    )
    print(
        f"Uploaded: {upload.created_responses} responses, "
        f"{upload.created_ratings} ratings, "
        f"{upload.pending_evaluations} pending trace evaluations"
    )
    if upload.errors:
        print(f"Errors: {upload.errors}")

# Step 6 (option B): skip elluminate's evaluation and upload pre-computed  # (10)!
# ratings you already have (e.g. from your own judge). `evaluate=False`
# prevents the backend from queuing its own evaluation. A fresh experiment
# keeps the two demo paths independent.
precomputed_experiment, precomputed_populated = get_or_create_agentic_experiment(
    "Harbor Demo — Pre-computed Ratings",
    "Harbor-run coding agent with externally computed ratings.",
)
precomputed_results = [
    AgentTrialResult(
        task_name=HARBOR_RUN[0]["instruction"],
        messages=HARBOR_RUN[0]["messages"],
        reward=1.0,
        trajectory=HARBOR_RUN[0]["trajectory"],
        criterion_ratings=[
            CriterionRatingIn(
                label="task-complete",
                rating="YES",
                reasoning="External judge verified the agent produced the requested file.",
            ),
        ],
    ),
]
if precomputed_populated:
    print("Pre-computed experiment already has responses; skipping Option B upload.")
else:
    precomputed_upload = precomputed_experiment.upload_agent_results(
        results=precomputed_results,
        task_name_column="task",
        evaluate=False,
    )
    print(
        f"Pre-computed: {precomputed_upload.created_responses} responses, "
        f"{precomputed_upload.created_ratings} ratings, "
        f"{precomputed_upload.pending_evaluations} pending trace evaluations"
    )

# Step 7: Verify the trajectories are queryable from the SDK.  # (11)!
experiment.fetch_responses()
for resp in experiment.responses():
    task = resp.prompt.template_variables.input_values.get("task", "?")
    steps = len(resp.trajectory["steps"]) if resp.trajectory else 0
    print(f"  [{task[:50]}] trajectory_steps={steps}")
  1. Initialize the SDK client (uses ELLUMINATE_API_KEY) and pick an LLMConfig to associate with the experiment (metadata only; AGENTIC experiments never invoke it).
  2. Stand-in for your runner's on-disk output. Replace with code that reads Harbor's task_result.json + trajectory.json per task.
  3. Translate one runner output into an AgentTrialResult. This is the only integration-specific code you need; task_name must match the value in the collection row's task column.
  4. Create an AGENTIC collection with a single task column (RAW_INPUT), one row per task. No prompt template is required.
  5. Create the criterion set that defines success. Labels are explicit so pre-computed ratings (Option B) can reference them.
  6. Idempotent helper that returns an AGENTIC experiment and whether it already holds uploaded responses.
  7. Get-or-create the main experiment. evaluation_mode="AGENTIC" disables auto-generation; responses are supplied via upload.
  8. Convert every runner output into an AgentTrialResult.
  9. Option A, automatic evaluation: upload with evaluate=True so elluminate rates every criterion against the trajectory. Skipped on re-runs when the experiment is already populated.
  10. Option B, pre-computed ratings: supply criterion_ratings and set evaluate=False to store your own judge's results as-is. Runs against a separate experiment to keep the two paths independent.
  11. Re-fetch the experiment and confirm trajectories are queryable from the SDK.