Tool Calling¶

Tool calling enables LLMs to interact with external systems by providing access to predefined functions during response generation. This documentation first covers the fundamentals of evaluating LLM tool calling with Elluminate. It then gives some practical guidance for how to adapt your existing agentic system to be evaluated with Elluminate. And lastly, it applies this guidance with an advanced example which features evaluating an agent with read access to a filesystem.

Basic Usage¶

An example showcasing weather tool integration for real-time data access:

    template, _ = await client.prompt_templates.aget_or_create(
        user_prompt_template="""\
You are a helpful weather assistant. The user is asking: {{user_query}}.
Use the available weather tools to provide accurate information.
Respond in the units most customary of the location being queried.""",
        name="Weather Assistant with Tools",
        tools=[
            FunctionTool(  # (1)!
                type="function",
                function={
                    "name": "get_current_weather",
                    "description": "Get the current weather in a given location",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The city and state/country, e.g. Berlin, DE",
                            },
                            "unit": {
                                "type": "string",
                                "enum": ["celsius", "fahrenheit"],
                                "description": "The temperature unit to use",
                            },
                        },
                        "required": ["location", "unit"],
                        "additionalProperties": False,
                    },
                    "strict": True,
                },
            ),
        ],
        tool_choice="auto",  # (2)!
    )

Define Tools: Set tools to function tool definitions using OpenAI's FunctionTool type. These definitions describe each function and specify exactly what input data the function expects and what output data it returns.
Create Template: Optionally, set tool_choice to control when the model should use tools. By default when omitted, it is "auto".

Once the prompt template has been defined with the tool definitions, the rest of the evaluation process is as normal. Experiments execute normally - the model automatically produces (but does not run) tool calls - and the criteria evaluate the chosen tools.

Tool calls can be found in the assistant message's 'tool_calls' as a list of tools called:

for message in response.messages:
    if message["role"] == "assistant":
        print(f"{message['tool_calls']}")

Tool Execution

Elluminate currently does not support running your tools. When running experiments, you will see the tool call that was selected, but not the actual output or results from that tool. In order to evaluate the executions of tools, refer to the Advanced Example section below, since some special care is necessary.

Complete Basic Example

import asyncio

from elluminate import Client
from elluminate.schemas import RatingMode
from openai.types.beta import FunctionTool


async def main():
    client = Client()

    template, _ = await client.prompt_templates.aget_or_create(
        user_prompt_template="""\
You are a helpful weather assistant. The user is asking: {{user_query}}.
Use the available weather tools to provide accurate information.
Respond in the units most customary of the location being queried.""",
        name="Weather Assistant with Tools",
        tools=[
            FunctionTool(  # (1)!
                type="function",
                function={
                    "name": "get_current_weather",
                    "description": "Get the current weather in a given location",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The city and state/country, e.g. Berlin, DE",
                            },
                            "unit": {
                                "type": "string",
                                "enum": ["celsius", "fahrenheit"],
                                "description": "The temperature unit to use",
                            },
                        },
                        "required": ["location", "unit"],
                        "additionalProperties": False,
                    },
                    "strict": True,
                },
            ),
        ],
        tool_choice="auto",  # (2)!
    )

    collection, _ = await client.collections.aget_or_create(
        name="Weather Query Test Data",
        variables=[
            {"user_query": "What's the weather like in London, UK and should I bring an umbrella?"},
            {"user_query": "Compare the current weather in Boston and San Francisco."},
        ],
    )

    await client.criteria.aget_or_generate_many(template)

    experiment, _ = await client.experiments.aget_or_create(
        name="Weather Tool Calling Experiment",
        prompt_template=template,
        collection=collection,
        description="Testing tool calling capabilities for weather queries",
        generate=True,
        rating_mode=RatingMode.FAST,
        block=True,
        n_epochs=1,
    )

    for i, response in enumerate(experiment.rated_responses, 1):
        print(f"Example {i}:")
        print(f"Query: {response.prompt.template_variables.input_values['user_query']}")
        print("Response:")

        for message in response.messages:
            if message["role"] == "assistant":
                print(f"{message['tool_calls']}")
        print()


if __name__ == "__main__":
    asyncio.run(main())

Tool Definition Methods¶

OpenAI Function Tool Format¶

Tools are defined using OpenAI's standard FunctionTool format:

from openai.types.beta import FunctionTool

weather_tool = FunctionTool(
    type="function",
    function={
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string", 
                    "description": "The city and state/country, e.g. Berlin, DE"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit to use"
                }
            },
            "required": ["location", "unit"],
            "additionalProperties": False
        },
        "strict": True
    }
)

Tool Choice Configuration¶

The tool_choice parameter controls when and how the model uses available tools:

Auto Selection (tool_choice="auto"): Let model decide when and which tools to use.
Required Usage (tool_choice="required"): Force model to use at least one tool.
Disabled Tools (tool_choice="none"): Disable all tools for this response.

Specific Function: Forces the model to call a specific tool.

tool_choice={
    "type": "function", 
    "function": {"name": "get_current_weather"}
}

Evaluating Your Agentic System¶

Evaluating an agentic system with Elluminate follows a similar process to evaluating a singular prompt. The only difference is instead of inferencing the prompt to get the LLM response directly, your agent can perform an arbitrary amount of tool calls before coming up with its final answer. Therefore, the evaluation is performed on the whole tool chain in addition to the final response. This section outlines the principle approach for evaluating your agentic system. The following section walks through a complete example applying this approach in practice.

The first step is almost identical to evaluating a singular prompt. You must load Elluminate with your prompt template, add or generate criteria and create a collection with representative inputs to your system.

Nuanced Differences

Tool Definitions as Rating Context: In order for the rating model to have access to all of the tools and their parameter descriptions, you must add the tool definitions to the prompt template. From the SDK, this can be done via the tools and tool_choice parameters on all create methods. The frontend also has a special field to add tool definitions. Doing this provides valuable context to the rating model during rating about what each tool and its parameters do.
Tailored Criteria for Tool Calls: It may be be beneficial to explicitly reference tools and their parameters by name in the criteria. This helps the rating model know precisely which part of the tool call to focus on during rating. For example, "Was get_current_weather called with the temperature units most customarily used in the given location?" is more precise than "Are the units correct for the city?".

With that in place, since Elluinate cannot execute tool calls, your existing agentic code must inference the tool chain and produce the final response on its own. Then you need to manually add the whole chain of tool calls, responses as well as the final output as a singular response in to Elluminate. Importantly including the whole chain of calls is needed if you want to evaluate the tool calling process as well as the final output of your agent.

Adding Tool Calls as a Response Manually

The SDK provides the method client.responses.add to manually add a response. This method accepts either a string or a list of openai completion messages as the response. When you provide a list of completion messages, they all constitute the singular response. This enables the rating model to rate not only the final output, but also any intermediate tool calls and tool results.

The method client.responses.add_many works in exactly the same manner, but is used for bulk adding responses.

Advanced Example¶

This advanced example demonstrates how to evaluate a minimal, agentic LLM system using Elluminate. The agent has access to basic filesystem tools that enable it to navigate directories, read files and analyze their metadata. It is tasked with basic questions such as "What is the largest file in the system?" and is evaluated on whether it responded correctly as well as the methods it used during the process.

Manual Tool Execution Required

Elluminate does not support tool execution. Therefore, tool execution must still be handled manually in your code. You must then provide the entire chain of tool calls, tool outputs and final response back to Elluminate as a manually added response.

Tool Function Implementation¶

This example first implements several filesystem operations as Python functions. There are functions to return the current working directory, change directory, list files among other operations.

Tool Function Implementation

def pwd() -> str:  # (1)!
    return os.getcwd()


def chdir(directory: str) -> None:  # (2)!
    os.chdir(directory)


def list_dir() -> list[tuple[str, str]]:  # (3)!
    ret = []
    items = os.listdir(".")
    for item in items:
        if os.path.isfile(item):
            ret.append((item, "FILE"))
        elif os.path.isdir(item):
            ret.append((item, "DIRECTORY"))

    return ret


def file_stats(file_name: str) -> dict[str, Any]:  # (4)!
    stat_info = os.stat(file_name)
    return {
        "Size (bytes)": stat_info.st_size,
        "Last Modified": datetime.datetime.fromtimestamp(stat_info.st_mtime).strftime("%Y-%m-%d %H:%M:%S"),
        "Created At": datetime.datetime.fromtimestamp(stat_info.st_ctime).strftime("%Y-%m-%d %H:%M:%S"),
    }


def read_file(file_name: str) -> str:  # (5)!
    with open(file_name, "r") as file:
        return file.read()

Current Directory: Returns the current working directory path
Directory Navigation: Changes the current working directory
Directory Listing: Lists all files and directories with their types
File Statistics: Retrieves detailed file metadata including size and timestamps
File Reading: Reads and returns the contents of text files

Tool Definition Setup¶

Tools are defined using OpenAI's FunctionTool format, mapping each Python function to a structured tool definition. So each of the above defined python methods gets its own FunctionTool defining its name, description and the arguments it accepts.

Tool Definition Setup

tools = [
    FunctionTool(
        type="function",
        function={
            "name": "pwd",
            "description": "Get the current working directory path",
            "parameters": {"type": "object", "properties": {}, "required": []},
        },
    ),
    FunctionTool(
        type="function",
        function={
            "name": "chdir",
            "description": "Change the current working directory",
            "parameters": {
                "type": "object",
                "properties": {"directory": {"type": "string", "description": "The directory path to change to"}},
                "required": ["directory"],
            },
        },
    ),
    FunctionTool(
        type="function",
        function={
            "name": "list_dir",
            "description": "List all files and directories in the current directory",
            "parameters": {"type": "object", "properties": {}, "required": []},
        },
    ),
    FunctionTool(
        type="function",
        function={
            "name": "file_stats",
            "description": "Get detailed statistics about a specific file including size, creation time, and modification time",
            "parameters": {
                "type": "object",
                "properties": {
                    "file_name": {"type": "string", "description": "The name of the file to get statistics for"}
                },
                "required": ["file_name"],
            },
        },
    ),
    FunctionTool(
        type="function",
        function={
            "name": "read_file",
            "description": "Read and return the contents of a text file",
            "parameters": {
                "type": "object",
                "properties": {"file_name": {"type": "string", "description": "The name of the file to read"}},
                "required": ["file_name"],
            },
        },
    ),
]

Inferencing with Tools¶

As mentioned previously, tool execution must be handled outside of Elluminate. In a for loop in the script, it inferences the prompt messages and checks whether a tool was invoked. If yes, it executes the tool and send the result back to the LLM. If no, it exits the loop with the full message conversation.

Multi-Turn Conversation Handling

def chat_with_tools(openai_client, messages, max_iterations=30):
    # Reset the current working directory to root since the tool calls from
    # previous chats may have changed directories
    chdir("/")

    responded_messages = []

    for iteration in range(1, max_iterations + 1):
        print(f"\n--- Iteration {iteration} ---")

        # Get response from OpenAI gpt-4o-mini with tool calling enabled
        response = openai_client.chat.completions.create(  # (1)!
            model="gpt-4o-mini",
            messages=messages + responded_messages,
            tools=[tool.model_dump() for tool in tools],
            tool_choice="auto",
        )

        assistant_message = response.choices[0].message
        responded_messages.append(assistant_message.model_dump())

        # Check if the LLM wants to call any tools and execute each tool
        if assistant_message.tool_calls:  # (2)!
            print("AI is calling tools...")

            for tool_call in assistant_message.tool_calls:
                print(f"Calling function: {tool_call.function.name}")
                print(f"Arguments: {tool_call.function.arguments}")

                function_result = execute_function_call(tool_call.function)
                print(f"Result: {function_result}")

                responded_messages.append(
                    {
                        "tool_call_id": tool_call.id,
                        "role": "tool",
                        "name": tool_call.function.name,
                        "content": function_result,
                    }
                )
        else:
            print("Final response:")
            print(assistant_message.content)
            return responded_messages

    print("Max iterations reached!")
    return responded_messages

OpenAI Integration: Uses OpenAI's chat completion API with the tool definitions and tool choice set to "auto" mode.
Tool Execution Loop: If tools are invoked, automatically executes them and provide the results back to the model as a continuation of the message conversation.

Experiment Setup and Evaluation¶

Putting everything together, this workflow integrates the tool execution code with Elluminate's experiment system. A prompt template with the FunctionTool definitions and a collection of representative user queries are created. Then criteria are manually assigned to the prompt template. Once set up, the tool execution code is run for every input in the collection. The full message conversations are saved back in to Elluminate and are rated as a part of an experiment.

Complete Experiment Workflow

prompt_template, _ = await client.prompt_templates.aget_or_create(  # (1)!
    user_prompt_template=[
        ChatCompletionSystemMessageParam(
            role="system",
            content="You are a helpful file system assistant. Use the provided tools to explore and analyze the file system. Always start by checking the current directory and exploring the structure before answering questions. Ignore the /tmp and /var directories.",
        ),
        ChatCompletionUserMessageParam(role="user", content="{{user_query}}"),
    ],
    name="File System Explorer with Tools",
    tools=tools,
    tool_choice="auto",
)

collection, _ = await client.collections.aget_or_create(
    name="File System Query Test Data",
    variables=[
        {"user_query": "What is the largest file in the system?", "answer": "app.log"},
        {"user_query": "Which file was created most recently?", "answer": "settings.json"},
        {
            "user_query": "Can you find all the text files (.txt) and tell me which one has the longest content?",
            "answer": "report.txt",
        },
        {"user_query": "Which files has configuration information?", "answer": "settings.json"},
    ],
)

await client.criteria.aadd_many(
    [
        "Did the response say the correct file was: {{answer}}?",
        "Were at most 7 tool calls used?",
        "Was 'list_dir' called on the /files directory by calling 'chdir' in to it?",
        "Does the tool call trace NOT show any sign of confusion or misguidance to solve the task at hand?",
    ],
    prompt_template,
    delete_existing=True,
)

responses = []
for template_vars in collection.variables:
    rendered_messages = prompt_template.render_messages(  # (2)!
        user_query=template_vars.input_values["user_query"]
    )

    # Handle tool calling conversation manually from this script
    response_messages = chat_with_tools(openai_client, rendered_messages)  # (3)!

    response = await client.responses.aadd(  # (4)!
        response=response_messages,
        prompt_template=prompt_template,
        template_variables=template_vars,
    )
    responses.append(response)

experiment, _ = await client.experiments.aget_or_create(  # (5)!
    name="File System Tool Calling Experiment",
    prompt_template=prompt_template,
    collection=collection,
    description="Testing tool calling capabilities for file system operations",
)

# Rate responses and attribute their ratings to the `experiment`
await client.ratings.arate_many(responses, experiment=experiment)

Prompt Template: Defines a basic system prompt and a user prompt to be filled in with a user_query from a collection.
Message Rendering: In order to inference manually, the user_query placeholder in the prompt template needs to be filled in. This returns the fully rendered messages which can be passed directly to OpenAI's completions client.
Manual Tool Execution: Handles the complete tool calling conversation manually outside Elluminate
Response Recording: Manually adds the final tool calling conversation as a response to Elluminate
Experiment Creation: Create an experiment and rate the responses against the prompt template's criteria

Complete Advanced Example

import asyncio
import datetime
import json
import os
import time
from typing import Any

from elluminate import Client
from openai import AzureOpenAI, OpenAI
from openai.types.beta import FunctionTool
from openai.types.chat import ChatCompletionSystemMessageParam, ChatCompletionUserMessageParam
from pyfakefs.fake_filesystem_unittest import Patcher


def get_openai_client() -> AzureOpenAI | OpenAI:
    if "AZURE_OPENAI_ENDPOINT" in os.environ:
        return AzureOpenAI(
            azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
            api_version=os.environ.get("OPENAI_API_VERSION"),
            api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
        )
    else:
        return OpenAI()


def pwd() -> str:  # (1)!
    return os.getcwd()


def chdir(directory: str) -> None:  # (2)!
    os.chdir(directory)


def list_dir() -> list[tuple[str, str]]:  # (3)!
    ret = []
    items = os.listdir(".")
    for item in items:
        if os.path.isfile(item):
            ret.append((item, "FILE"))
        elif os.path.isdir(item):
            ret.append((item, "DIRECTORY"))

    return ret


def file_stats(file_name: str) -> dict[str, Any]:  # (4)!
    stat_info = os.stat(file_name)
    return {
        "Size (bytes)": stat_info.st_size,
        "Last Modified": datetime.datetime.fromtimestamp(stat_info.st_mtime).strftime("%Y-%m-%d %H:%M:%S"),
        "Created At": datetime.datetime.fromtimestamp(stat_info.st_ctime).strftime("%Y-%m-%d %H:%M:%S"),
    }


def read_file(file_name: str) -> str:  # (5)!
    with open(file_name, "r") as file:
        return file.read()



tools = [
    FunctionTool(
        type="function",
        function={
            "name": "pwd",
            "description": "Get the current working directory path",
            "parameters": {"type": "object", "properties": {}, "required": []},
        },
    ),
    FunctionTool(
        type="function",
        function={
            "name": "chdir",
            "description": "Change the current working directory",
            "parameters": {
                "type": "object",
                "properties": {"directory": {"type": "string", "description": "The directory path to change to"}},
                "required": ["directory"],
            },
        },
    ),
    FunctionTool(
        type="function",
        function={
            "name": "list_dir",
            "description": "List all files and directories in the current directory",
            "parameters": {"type": "object", "properties": {}, "required": []},
        },
    ),
    FunctionTool(
        type="function",
        function={
            "name": "file_stats",
            "description": "Get detailed statistics about a specific file including size, creation time, and modification time",
            "parameters": {
                "type": "object",
                "properties": {
                    "file_name": {"type": "string", "description": "The name of the file to get statistics for"}
                },
                "required": ["file_name"],
            },
        },
    ),
    FunctionTool(
        type="function",
        function={
            "name": "read_file",
            "description": "Read and return the contents of a text file",
            "parameters": {
                "type": "object",
                "properties": {"file_name": {"type": "string", "description": "The name of the file to read"}},
                "required": ["file_name"],
            },
        },
    ),
]

# Function mapping for tool execution
function_map = {
    "pwd": pwd,
    "chdir": chdir,
    "list_dir": list_dir,
    "file_stats": file_stats,
    "read_file": read_file,
}


def execute_function_call(function_call) -> str:
    """Execute a function call and return the result"""
    function_name = function_call.name
    arguments = json.loads(function_call.arguments)

    try:
        return str(function_map[function_name](**arguments))
    except Exception as e:
        return f"Error executing {function_name}: {str(e)}"


def chat_with_tools(openai_client, messages, max_iterations=30):
    # Reset the current working directory to root since the tool calls from
    # previous chats may have changed directories
    chdir("/")

    responded_messages = []

    for iteration in range(1, max_iterations + 1):
        print(f"\n--- Iteration {iteration} ---")

        # Get response from OpenAI gpt-4o-mini with tool calling enabled
        response = openai_client.chat.completions.create(  # (1)!
            model="gpt-4o-mini",
            messages=messages + responded_messages,
            tools=[tool.model_dump() for tool in tools],
            tool_choice="auto",
        )

        assistant_message = response.choices[0].message
        responded_messages.append(assistant_message.model_dump())

        # Check if the LLM wants to call any tools and execute each tool
        if assistant_message.tool_calls:  # (2)!
            print("AI is calling tools...")

            for tool_call in assistant_message.tool_calls:
                print(f"Calling function: {tool_call.function.name}")
                print(f"Arguments: {tool_call.function.arguments}")

                function_result = execute_function_call(tool_call.function)
                print(f"Result: {function_result}")

                responded_messages.append(
                    {
                        "tool_call_id": tool_call.id,
                        "role": "tool",
                        "name": tool_call.function.name,
                        "content": function_result,
                    }
                )
        else:
            print("Final response:")
            print(assistant_message.content)
            return responded_messages

    print("Max iterations reached!")
    return responded_messages




def setup_filesystem(patcher):
    """Set up the fake filesystem with sample files"""
    # Create a more interesting fake file system
    patcher.fs.create_file(
        "/files/report.txt",
        contents="Annual sales report for 2024. Revenue increased by 15% compared to last year.",
    )
    patcher.fs.create_file("/files/notes.txt", contents="Meeting notes from project kickoff.")
    patcher.fs.create_file("/files/photo1.jpg", contents="[Binary image data would be here]")
    patcher.fs.create_file(
        "/logs/app.log",
        contents="2024-01-15 10:30:22 - Application started\n2024-01-15 10:31:45 - User logged in\n2024-01-15 11:22:10 - Error: Connection timeout",
    )

    # Sleep to create time differences
    time.sleep(2)
    patcher.fs.create_file("/files/cache.tmp", contents="Temporary cache data")
    time.sleep(1)
    patcher.fs.create_file(
        "/files/settings.json", contents='{"theme": "dark", "auto_save": true, "max_files": 100}'
    )


async def main():
    client = Client()
    openai_client = get_openai_client()

    # Set up a "virtual" filesystem. All `os` commands within this context manager will be faked
    with Patcher() as patcher:
        setup_filesystem(patcher)

        prompt_template, _ = await client.prompt_templates.aget_or_create(  # (1)!
            user_prompt_template=[
                ChatCompletionSystemMessageParam(
                    role="system",
                    content="You are a helpful file system assistant. Use the provided tools to explore and analyze the file system. Always start by checking the current directory and exploring the structure before answering questions. Ignore the /tmp and /var directories.",
                ),
                ChatCompletionUserMessageParam(role="user", content="{{user_query}}"),
            ],
            name="File System Explorer with Tools",
            tools=tools,
            tool_choice="auto",
        )

        collection, _ = await client.collections.aget_or_create(
            name="File System Query Test Data",
            variables=[
                {"user_query": "What is the largest file in the system?", "answer": "app.log"},
                {"user_query": "Which file was created most recently?", "answer": "settings.json"},
                {
                    "user_query": "Can you find all the text files (.txt) and tell me which one has the longest content?",
                    "answer": "report.txt",
                },
                {"user_query": "Which files has configuration information?", "answer": "settings.json"},
            ],
        )

        await client.criteria.aadd_many(
            [
                "Did the response say the correct file was: {{answer}}?",
                "Were at most 7 tool calls used?",
                "Was 'list_dir' called on the /files directory by calling 'chdir' in to it?",
                "Does the tool call trace NOT show any sign of confusion or misguidance to solve the task at hand?",
            ],
            prompt_template,
            delete_existing=True,
        )

        responses = []
        for template_vars in collection.variables:
            rendered_messages = prompt_template.render_messages(  # (2)!
                user_query=template_vars.input_values["user_query"]
            )

            # Handle tool calling conversation manually from this script
            response_messages = chat_with_tools(openai_client, rendered_messages)  # (3)!

            response = await client.responses.aadd(  # (4)!
                response=response_messages,
                prompt_template=prompt_template,
                template_variables=template_vars,
            )
            responses.append(response)

        experiment, _ = await client.experiments.aget_or_create(  # (5)!
            name="File System Tool Calling Experiment",
            prompt_template=prompt_template,
            collection=collection,
            description="Testing tool calling capabilities for file system operations",
        )

        # Rate responses and attribute their ratings to the `experiment`
        await client.ratings.arate_many(responses, experiment=experiment)

        print("Finished all ratings. View the results from the frontend.")


if __name__ == "__main__":
    asyncio.run(main())