Agentic Evaluations¶
Evaluate autonomous agents end-to-end: tasks, trajectories, trace-based criteria, and aggregate agent metrics.
Agentic evaluations extend elluminate beyond single-turn LLM outputs to cover agents that plan, call tools, and work against a task description over many steps. You run the agent externally (with Harbor, a custom harness, or any framework whose output you can translate to the ATIF trajectory format) and upload the trial results, including full trajectories, to elluminate. elluminate then rates each criterion against the trajectory, and the UI surfaces the trace, per-criterion ratings, and aggregate metrics.
When to use agentic evaluations¶
Use this workflow when:
- Your system makes multiple LLM calls per task (tool use, plan/act loops, sub-agents).
- Evaluation needs to look at what the agent did, not only at its final message.
- You already have (or want to keep) an external runner, e.g. Harbor, LangChain, CrewAI, AutoGen, or your own code.
For single-turn outputs, or tool-calling patterns where elluminate generates the responses itself, see the Tool Calling guide instead.
elluminate does not execute your agent
Agentic evaluations cover uploading and rating external agent runs. You run the agent yourself with Harbor, a custom harness, or another framework; elluminate stores the trial results, renders the trajectory viewer, and (optionally) rates each criterion against the trajectory.
Trajectory format
elluminate accepts trajectories in its own ATIF format (v1.*) only. No native importer exists for LangChain, CrewAI, AutoGen, or any other framework — you translate your runner's output to ATIF and upload via the SDK. The example in the SDK Reference section below shows the translation for a Harbor-shaped per-task output; the same pattern applies to any other framework.
UI workflow¶

- Create an AGENTIC collection. Each row is one task description. The collection's
collection_typeis set toAGENTIC, which unlocks the trajectory viewer and agentic metrics downstream. Because AGENTIC experiments never auto-generate responses, the collection uses a singleRAW_INPUTcolumn for the task text; no prompt template is required. - Create a criterion set. Each criterion is a binary YES/NO question elluminate will answer against the trajectory (e.g. "Did the agent edit the correct file?").
- Create an AGENTIC experiment linking the template, collection, and criterion set. The experiment's
evaluation_modeis set toAGENTIC, which disables auto-generation; responses come from your external runner. - Run the agent externally (Harbor, your own script) and collect one result per task, including an ATIF trajectory.
- Upload the results via the SDK. When trajectories are present, elluminate automatically rates each criterion against the trajectory.
- Inspect the results in the UI: trajectory viewer, per-criterion ratings with reasoning, and aggregate metrics (mean reward, mean execution time, mean cost).

Concept mapping: Harbor to elluminate¶
If you are coming from Harbor, the concepts roughly map as follows:
| Harbor | elluminate |
|---|---|
| Dataset | Collection (collection_type: AGENTIC) |
| Task | Row in the collection (one TemplateVariables with a RAW_INPUT task column) |
instruction.md |
The task text stored in the row's RAW_INPUT task column |
task.toml env config |
Collection environment_config (optional) |
| Reward | AgentTrialResult.reward |
| Trajectory | AgentTrialResult.trajectory (ATIF) |
| Test / judge scripts | Criterion set (evaluated against the trajectory) |
| Agent | Experiment evaluation_mode: AGENTIC |
AgentTrialResult payload¶
Each trial your runner produces maps to one AgentTrialResult. Required and optional fields:
| Field | Required | Description |
|---|---|---|
task_name |
yes | Must exactly match a value in the collection column given as task_name_column. |
messages |
no | Final OpenAI-format message list (shown on the response page). |
reward |
no | Primary reward score (0.0–1.0). |
steps |
no | Number of agent steps / LLM calls. |
cost_usd |
no | Total USD cost for the trial. |
duration_seconds |
no | Wall-clock duration. |
input_tokens |
no | Aggregate input tokens. |
output_tokens |
no | Aggregate output tokens. |
cached_tokens |
no | Aggregate cached input tokens. |
error |
no | Error message if the trial failed. |
metadata |
no | Free-form dict surfaced on the response page. |
trajectory |
no | Raw ATIF trajectory (validated by the backend; see ATIF format). |
criterion_ratings |
no | Pre-computed ratings. Skip elluminate's evaluation by providing your own (see Evaluation modes). |
ATIF trajectory format¶
Trajectories use the Agent Trajectory Interchange Format (ATIF). The backend validates the payload on upload but stores it raw, so extra keys are preserved verbatim for the trajectory viewer.
A minimal ATIF v1 trajectory:
{
"schema_version": "ATIF-v1.0",
"session_id": "harbor-run-001/write-hello-world",
"agent": {
"name": "harbor-demo-agent",
"version": "0.1.0",
"model_name": "claude-sonnet-4-6"
},
"steps": [
{
"step_id": 1,
"source": "user",
"message": "Write a Python hello world script to hello.py"
},
{
"step_id": 2,
"source": "agent",
"message": "Writing hello.py.",
"tool_calls": [
{
"tool_call_id": "tc_1",
"function_name": "write_file",
"arguments": {"path": "hello.py", "content": "print('Hello, World!')"}
}
],
"observation": {
"results": [{"source_call_id": "tc_1", "content": "wrote 22 bytes"}]
},
"metrics": {"prompt_tokens": 420, "completion_tokens": 61, "cost_usd": 0.0042}
}
],
"final_metrics": {
"total_steps": 2,
"total_cost_usd": 0.0042,
"total_prompt_tokens": 420,
"total_completion_tokens": 61
}
}
Evaluation modes¶
Agentic evaluations support two complementary paths for producing ratings.
Automatic evaluation (default)¶
Upload trajectories with evaluate=True (the default). For each criterion in the criterion set, elluminate reads the trajectory and emits a YES/NO rating with reasoning. Use this path when you want elluminate to do the judging.
Pre-computed ratings¶
Attach your own ratings to each trial via criterion_ratings and set evaluate=False. elluminate stores them as-is. Use this path when you already run an external judge, or when you want to import historical runs without re-evaluating. Labels that don't yet exist on the experiment's criterion set are created on upload, so this path can also seed new criteria (useful for backfilling runs whose criteria weren't declared up front).
Mixing modes
You can also leave evaluate=True while supplying criterion_ratings. Pre-computed ratings are stored immediately and elluminate evaluates any criteria that were not pre-rated. This is useful for partial imports.
Limitations¶
- No execution: elluminate does not run your agent. Any runner operates outside elluminate; this guide covers uploading the results.
- Schema version: trajectories must set
schema_versiontoATIF-v1.*. Unknown versions are rejected.
SDK Reference¶
The following script covers the full end-to-end flow: AGENTIC collection, AGENTIC experiment, conversion of runner output, and upload with elluminate's evaluation queued. The script is idempotent — collections, criterion sets, and experiments are reused across runs, and uploads are skipped when an experiment already contains responses.
Running the example¶
Set your API key (created in the elluminate UI under Project → Keys) either as an environment variable or in a .env file next to the script; the example calls load_dotenv():
# option 1: shell
export ELLUMINATE_API_KEY=<your-key>
# optionally, if you run elluminate on a non-default host:
# export ELLUMINATE_BASE_URL=https://your-instance.example.com
# option 2: .env in elluminate_sdk/examples/
echo "ELLUMINATE_API_KEY=<your-key>" > elluminate_sdk/examples/.env
# run
uv run --directory elluminate_sdk python examples/example_harbor_agentic_upload.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 | |
- Initialize the SDK client (uses
ELLUMINATE_API_KEY) and pick anLLMConfigto associate with the experiment (metadata only; AGENTIC experiments never invoke it). - Stand-in for your runner's on-disk output. Replace with code that reads Harbor's
task_result.json+trajectory.jsonper task. - Translate one runner output into an
AgentTrialResult. This is the only integration-specific code you need;task_namemust match the value in the collection row's task column. - Create an AGENTIC collection with a single
taskcolumn (RAW_INPUT), one row per task. No prompt template is required. - Create the criterion set that defines success. Labels are explicit so pre-computed ratings (Option B) can reference them.
- Idempotent helper that returns an AGENTIC experiment and whether it already holds uploaded responses.
- Get-or-create the main experiment.
evaluation_mode="AGENTIC"disables auto-generation; responses are supplied via upload. - Convert every runner output into an
AgentTrialResult. - Option A, automatic evaluation: upload with
evaluate=Trueso elluminate rates every criterion against the trajectory. Skipped on re-runs when the experiment is already populated. - Option B, pre-computed ratings: supply
criterion_ratingsand setevaluate=Falseto store your own judge's results as-is. Runs against a separate experiment to keep the two paths independent. - Re-fetch the experiment and confirm trajectories are queryable from the SDK.