Tool Calling¶
Tool calling enables LLMs to interact with external systems by providing access to predefined functions during response generation. This documentation first covers the fundamentals of evaluating LLM tool calling with Elluminate. It then gives some practical guidance for how to adapt your existing agentic system to be evaluated with Elluminate. And lastly, it applies this guidance with an advanced example which features evaluating an agent with read access to a filesystem.
Basic Usage¶
An example showcasing weather tool integration for real-time data access:
-
Define Tools: Set
tools
to function tool definitions using OpenAI'sFunctionTool
type. These definitions describe each function and specify exactly what input data the function expects and what output data it returns. -
Create Template: Optionally, set
tool_choice
to control when the model should use tools. By default when omitted, it is "auto".
Once the prompt template has been defined with the tool definitions, the rest of the evaluation process is as normal. Experiments execute normally - the model automatically produces (but does not run) tool calls - and the criteria evaluate the chosen tools.
Tool calls can be found in the assistant message's 'tool_calls'
as a list of tools called:
Tool Execution
Elluminate currently does not support running your tools. When running experiments, you will see the tool call that was selected, but not the actual output or results from that tool. In order to evaluate the executions of tools, refer to the Advanced Example section below, since some special care is necessary.
Complete Basic Example
Tool Definition Methods¶
OpenAI Function Tool Format¶
Tools are defined using OpenAI's standard FunctionTool
format:
from openai.types.beta import FunctionTool
weather_tool = FunctionTool(
type="function",
function={
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state/country, e.g. Berlin, DE"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use"
}
},
"required": ["location", "unit"],
"additionalProperties": False
},
"strict": True
}
)
Tool Choice Configuration¶
The tool_choice
parameter controls when and how the model uses available tools:
-
Auto Selection (
tool_choice="auto"
): Let model decide when and which tools to use. -
Required Usage (
tool_choice="required"
): Force model to use at least one tool. -
Disabled Tools (
tool_choice="none"
): Disable all tools for this response. -
Specific Function: Forces the model to call a specific tool.
Evaluating Your Agentic System¶
Evaluating an agentic system with Elluminate follows a similar process to evaluating a singular prompt. The only difference is instead of inferencing the prompt to get the LLM response directly, your agent can perform an arbitrary amount of tool calls before coming up with its final answer. Therefore, the evaluation is performed on the whole tool chain in addition to the final response. This section outlines the principle approach for evaluating your agentic system. The following section walks through a complete example applying this approach in practice.
The first step is almost identical to evaluating a singular prompt. You must load Elluminate with your prompt template, add or generate criteria and create a collection with representative inputs to your system.
Nuanced Differences
- Tool Definitions as Rating Context: In order for the rating model to have access to all of the tools and their parameter descriptions, you must add the tool definitions to the prompt template. From the SDK, this can be done via the
tools
andtool_choice
parameters on all create methods. The frontend also has a special field to add tool definitions. Doing this provides valuable context to the rating model during rating about what each tool and its parameters do. - Tailored Criteria for Tool Calls: It may be be beneficial to explicitly reference tools and their parameters by name in the criteria. This helps the rating model know precisely which part of the tool call to focus on during rating. For example, "Was
get_current_weather
called with the temperatureunits
most customarily used in the givenlocation
?" is more precise than "Are the units correct for the city?".
With that in place, since Elluinate cannot execute tool calls, your existing agentic code must inference the tool chain and produce the final response on its own. Then you need to manually add the whole chain of tool calls, responses as well as the final output as a singular response in to Elluminate. Importantly including the whole chain of calls is needed if you want to evaluate the tool calling process as well as the final output of your agent.
Adding Tool Calls as a Response Manually
The SDK provides the method client.responses.add
to manually add a response. This method accepts either a string or a list of openai completion messages as the response. When you provide a list of completion messages, they all constitute the singular response. This enables the rating model to rate not only the final output, but also any intermediate tool calls and tool results.
The method client.responses.add_many
works in exactly the same manner, but is used for bulk adding responses.
Advanced Example¶
This advanced example demonstrates how to evaluate a minimal, agentic LLM system using Elluminate. The agent has access to basic filesystem tools that enable it to navigate directories, read files and analyze their metadata. It is tasked with basic questions such as "What is the largest file in the system?" and is evaluated on whether it responded correctly as well as the methods it used during the process.
Manual Tool Execution Required
Elluminate does not support tool execution. Therefore, tool execution must still be handled manually in your code. You must then provide the entire chain of tool calls, tool outputs and final response back to Elluminate as a manually added response.
Tool Function Implementation¶
This example first implements several filesystem operations as Python functions. There are functions to return the current working directory, change directory, list files among other operations.
Tool Function Implementation
- Current Directory: Returns the current working directory path
- Directory Navigation: Changes the current working directory
- Directory Listing: Lists all files and directories with their types
- File Statistics: Retrieves detailed file metadata including size and timestamps
- File Reading: Reads and returns the contents of text files
Tool Definition Setup¶
Tools are defined using OpenAI's FunctionTool
format, mapping each Python function to a structured tool definition. So each of the above defined python methods gets its own FunctionTool
defining its name, description and the arguments it accepts.
Tool Definition Setup
Inferencing with Tools¶
As mentioned previously, tool execution must be handled outside of Elluminate. In a for loop in the script, it inferences the prompt messages and checks whether a tool was invoked. If yes, it executes the tool and send the result back to the LLM. If no, it exits the loop with the full message conversation.
Multi-Turn Conversation Handling
- OpenAI Integration: Uses OpenAI's chat completion API with the tool definitions and tool choice set to
"auto"
mode. - Tool Execution Loop: If tools are invoked, automatically executes them and provide the results back to the model as a continuation of the message conversation.
Experiment Setup and Evaluation¶
Putting everything together, this workflow integrates the tool execution code with Elluminate's experiment system. A prompt template with the FunctionTool
definitions and a collection of representative user queries are created. Then criteria are manually assigned to the prompt template. Once set up, the tool execution code is run for every input in the collection. The full message conversations are saved back in to Elluminate and are rated as a part of an experiment.
Complete Experiment Workflow
- Prompt Template: Defines a basic system prompt and a user prompt to be filled in with a
user_query
from a collection. - Message Rendering: In order to inference manually, the
user_query
placeholder in the prompt template needs to be filled in. This returns the fully rendered messages which can be passed directly to OpenAI's completions client. - Manual Tool Execution: Handles the complete tool calling conversation manually outside Elluminate
- Response Recording: Manually adds the final tool calling conversation as a response to Elluminate
- Experiment Creation: Create an experiment and rate the responses against the prompt template's criteria
Complete Advanced Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 |
|