Multi-Turn Persona Evaluations (Beta)¶
Evaluate your chatbot end-to-end by letting a simulated user pursue a goal across a real, multi-turn conversation.
Multi-Turn Persona Evaluations let you measure how your bot behaves over a full dialogue, not just a single reply. A simulator LLM role-plays a persona with a concrete goal, your bot under test responds, and the resulting transcript is rated as one unit by your criterion set.
Quick Start¶
- Go to Collections → New Collection and pick the Persona preset tile.
- Add a row and fill in
persona_name,persona_description, andgoal_summary. - Go to Prompt Templates and create a system-only template — this is the system prompt for your bot under test (exactly one system message, no user or assistant turns).
- Go to Experiments → New Experiment and select the Persona collection.
- Confirm the Multi-turn evaluation banner appears, then pick the system-only template, an LLM config for the bot, and a criterion set.
- Run the experiment. The full transcript is rated by the criterion set.
What It Is¶
A simulator LLM role-plays a persona who is trying to accomplish a goal, your bot under test answers each turn, and the resulting conversation is stored and rated as a whole. Both sides talk to each other until the simulator decides the conversation should end, or a turn limit is hit.
When to use which¶
- Multi-Turn Persona Evaluations: a live, simulated user drives a real multi-turn dialogue with your bot. Use this to evaluate end-to-end behavior under realistic, goal-directed conversations.
- Conversations: a static, pre-recorded message history is replayed and your bot generates one next reply. Use this to test specific turns with controlled context.
- Agentic Evaluations: a tool-using agent executes a task with tool calls. Use this when the system under test is an agent rather than a chatbot.
Key Concepts¶
- Persona Collection: a new collection type where each row describes one persona. Auto-created columns:
persona_name,persona_description,goal_summary. An optionalenvironment_configcolumn can hold per-row configuration for the bot's environment. - Simulator: the LLM that plays the user. The simulator is a deployment-level setting and is not user-configurable.
- Bot under test: the LLM that plays your assistant. It is defined by the experiment's LLM config plus a system-only prompt template.
- MULTI_TURN evaluation mode: automatically selected when the experiment uses a Persona collection. You don't pick it manually.
- Termination: the simulator can call an
end_conversationtool with one of three reasons —goal_met,goal_failed, orstuck. If it never calls the tool, the loop stops at a hard cap ofMAX_TURNS = 16.
How a Run Works¶
For each persona row in the collection:
- Pick the next persona row.
- The simulator generates the next user message. It sees the full conversation history with roles flipped (your bot's replies appear as
usermessages to the simulator, and vice versa). - The bot under test replies, using your system-only prompt template plus the experiment's LLM config.
- Steps 2 and 3 loop until the simulator calls
end_conversationorMAX_TURNSis reached. - The final transcript is stored as the response and rated by the criterion set.
Stateless vs. stateful providers
Most providers (OpenAI, Custom API, Mock) are stateless: the full conversation history is replayed on every bot turn. Botario keeps server-side session state keyed by sessionId, so only the current user turn is sent on each call. Botario also silently drops the bot's system prompt because state lives on the Botario side — the system prompt is still kept in the stored transcript, but it is not sent to the bot.
Creating a Persona Collection (UI)¶
- Open your project's Collections page and click New Collection.
-
On the New Collection page, pick the Persona preset tile. The collection is created with type
PERSONAand three pre-defined columns:persona_namepersona_descriptiongoal_summary

-
Add a row per persona you want to evaluate.
- Optionally add an
environment_configcolumn for per-row bot environment configuration. This column is not required.
The resulting collection is tagged as a Persona collection and exposes the pre-configured columns:

A Persona collection cannot mix with Raw Input or Conversation columns — it is its own collection type.
The Bot's Prompt Template¶
The bot under test is configured via a system-only prompt template:
- Exactly one
systemmessage. - No
userorassistantmessages. - No placeholders on the user side — the simulator drives the user turns, so user-side placeholders don't apply.
When you choose a Persona collection on the New Experiment page, the template picker is filtered to only allow system-only templates.
Running a Multi-Turn Experiment (UI)¶
- Go to Experiments → New Experiment.
-
Select your Persona collection. The experiment automatically switches to
MULTI_TURNmode and the Multi-turn evaluation banner appears, confirming the setup.
-
Pick:
- the system-only prompt template (your bot's system prompt). The template picker is filtered to only show templates without user messages, because only a system message is passed to the bot's LLM endpoint — the user turns come from the simulator.
- the LLM config for the bot under test,
- the criterion set that will rate the resulting transcripts.
- Start the experiment. One conversation is run per persona row.
Supported bot providers: OpenAI, Botario, Custom API, Mock.
Reading the Results¶
- The full dialogue between simulator and bot is stored on the response and visible in the response viewer.
- Criterion ratings apply to the entire transcript, not to a single turn.
- The termination reason (
goal_met,goal_failed,stuck, orMAX_TURNS) is surfaced on the response, so you can filter and analyze how conversations ended.
Related¶
- Conversations — static, pre-recorded conversation histories
- Agentic Evaluations — tool-using agent evaluations
- Criterion Sets — rule collections used to rate transcripts
- Experiments — running evaluations end-to-end