Tool Calling¶

Note

This tutorial requires a Unitxt installation.

Introduction¶

This tutorial explores tool calling with Unitxt, focusing on handling tool-based datasets and creating an evaluation and inference pipeline. By the end, you’ll be equipped to process complex tool calling tasks efficiently.

Part 1: Understanding Tool Calling Tasks¶

Tool calling tasks involve providing a model with instructions or prompts that require the use of specific tools to generate the correct response. These tasks are increasingly important in modern AI applications that need to interact with external systems.

Tool Calling Schema¶

Unitxt uses specific typed structures for tool calling:

class Tool(TypedDict):
    name: str
    description: str
    parameters: JsonSchema # a well defined json schema

class ToolCall(TypedDict):
    name: str
    arguments: Dict[str, Any]

The task schema for supervised tool calling is defined as:

Task(
    __description__="""Task to test tool calling capabilities.""",
    input_fields={"query": str, "tools": List[Tool]},
    reference_fields={"reference_calls": List[ToolCall]},
    prediction_type=ToolCall,
    metrics=["metrics.tool_calling"],
    default_template="templates.tool_calling.base",
)

This schema appears in the catalog as tasks.tool_calling.supervised and is the foundation for our tool calling evaluation pipeline.

Tutorial Overview¶

We’ll create a tool calling evaluation pipeline using Unitxt, concentrating on tasks where models need to select the right tool and provide appropriate arguments for the tool’s parameters. We’ll use the Berkeley Function Calling Leaderboard as our example dataset.

Part 2: Data Preparation¶

Creating a Unitxt DataCard¶

Our first step is to prepare the data using a Unitxt DataCard. If it’s your first time adding a DataCard, we recommend reading the Adding Datasets Tutorial.

Dataset Selection¶

We’ll use the Berkeley Function Calling Leaderboard dataset, which is designed to evaluate LLMs’ ability to call functions correctly across diverse categories and use cases.

DataCard Implementation¶

Create a Python file named bfcl.py and implement the DataCard as follows:

import unitxt
from unitxt.card import TaskCard
from unitxt.catalog import add_to_catalog
from unitxt.collections_operators import DictToTuplesList, Wrap
from unitxt.loaders import LoadCSV
from unitxt.operators import Copy
from unitxt.stream_operators import JoinStreams
from unitxt.test_utils.card import test_card
from unitxt.tool_calling import ToTool

# Base path to the Berkeley Function Calling Leaderboard data
base_path = "https://raw.githubusercontent.com/ShishirPatil/gorilla/70b6a4a2144597b1f99d1f4d3185d35d7ee532a4/berkeley-function-call-leaderboard/data/"

with unitxt.settings.context(allow_unverified_code=True):
    card = TaskCard(
        loader=LoadCSV(
            files={"questions": base_path + "BFCL_v3_simple.json", "answers": base_path + "possible_answer/BFCL_v3_simple.json"},
            file_type="json",
            lines=True,
            data_classification_policy=["public"],
        ),
        preprocess_steps=[
            # Join the questions and answers streams
            JoinStreams(left_stream="questions", right_stream="answers", how="inner", on="id", new_stream_name="test"),
            # Extract the query from the question content
            Copy(field="question/0/0/content", to_field="query"),
            # Starting to build the tools field as List[Tool]
            Copy(field="function", to_field="tools"),
            # Make Sure the json schema of the parameters is well defined
            RecursiveReplace(key="type", map_values={"dict": "object", "float": "number", "tuple": "array"}, remove_values=["any"]),
            # Process ground truth data in this dataset, which is a provided as a list of options per field,
            # and convert it into a list of explicit tool calls
            #
            #[{"geometry.circumference": {"radius": [3], "units": ["cm", "m"]}}]}
            # becomes:
            # [{"name": "geometry.circumference", "arguments" : {"radius": 3, "units": "cm"}},
            #  {"name": "geometry.circumference", "arguments" : {"radius": 3, "units": "m"}}]
            ExecuteExpression(expression='[{"name": k, "arguments": dict(zip(v.keys(), vals))} for d in ground_truth for k, v in d.items() for vals in itertools.product(*v.values())]',
                          to_field="reference_calls", imports_list=["itertools"])
        ],
        task="tasks.tool_calling.supervised",
        templates=["templates.tool_calling.base"],
        __description__=(
            """The Berkeley function calling leaderboard is a live leaderboard to evaluate the ability of different LLMs to call functions (also referred to as tools). We built this dataset from our learnings to be representative of most users' function calling use-cases, for example, in agents, as a part of enterprise workflows, etc. To this end, our evaluation dataset spans diverse categories, and across multiple languages."""
        ),
    )

    # Test and add the card to the catalog
    test_card(card, strict=False)
    add_to_catalog(card, "cards.bfcl.simple_v3", overwrite=True)

Preprocessing for Task Schema¶

Each preprocessing step serves a specific purpose in transforming the raw data into the required task schema:

JoinStreams: Combines question and answer data based on ID
Copy(field="question/0/0/content", to_field="query"): Creates the query input field
Copy(field="function", to_field="tools"): Creates the tools list input field
RecursiveReplace(key="type", map_values={"dict": "object", "float": "number", "tuple": "array"}, remove_values=["any"]): Converts parameters definitions to the JsonSchema structure
DictToTuplesList and subsequent Copy operations: Create the reference call field with the proper ToolCall structure

After preprocessing, each example will have: - A query that the model should respond to - Available tools that the model can choose from - A reference call showing which tool should be called with what arguments

Part 3: Inference and Evaluation¶

With our data prepared, we can now test model performance on tool calling tasks.

Pipeline Setup¶

Set up the inference and evaluation pipeline:

from unitxt import get_logger
from unitxt.api import evaluate, load_dataset
from unitxt.inference import CrossProviderInferenceEngine

logger = get_logger()

# Load and prepare the dataset
dataset = load_dataset(
    card="cards.bfcl.simple_v3",
    split="test",
    format="formats.chat_api",  # Format suitable for tool calling
)

# Initialize the inference model with a compatible provider
model = CrossProviderInferenceEngine(
    model="granite-3-3-8b-instruct",  # Or other models supporting tool calling
    provider="watsonx"
)

Executing Inference and Evaluation¶

Run the model and evaluate the results:

# Perform inference
predictions = model(dataset)

# Evaluate the predictions
results = evaluate(predictions=predictions, data=dataset)

print("Instance Results:")
print(results.instance_scores)

# Print the results
print("Global Results:")
print(results.global_scores.summary)

Part 4: Understanding the Tool Calling Metrics¶

The ToolCallingMetric in Unitxt provides several useful scores:

class ToolCallingMetric(ReductionInstanceMetric[str, Dict[str, float]]):
    main_score = "exact_match"
    reduction = MeanReduction()
    prediction_type = ToolCall

    def map(
        self, prediction: ToolCall, references: List[ToolCall], task_data: Dict[str, Any]
    ) -> Dict[str, float]:
        # Implementation details...
        return {
            self.main_score: exact_match,
                "tool_name_accuracy": tool_choice,
                "argument_name_recall": parameter_recall,
                "argument_name_precision": parameter_precision,
                "argument_value_precision": parameter_value_precision,
                "argument_schema_validation": parameter_schema_validation,
        }

The metrics evaluate different aspects of tool calling accuracy:

exact_match: Measures if the tool call exactly matches a reference
tool_choice_accuracy: Evaluates if the correct tool was selected
parameter_name_recall: Assesses if all relevant parameters were set
parameter_name_precision: Assesses if the parameter names are correct
parameter_value_precision: Assesses if the parameter values are correct
parameter_schema_validation: Verifies if parameter types match the tool definition

Custom Evaluation¶

For more specialized evaluation, you can define custom metrics:

from unitxt.metrics import ToolCallingMetric

# Evaluate with a specialized tool calling metric
custom_results = evaluate(
    predictions=predictions,
    data=dataset,
    metrics=[ToolCallingMetric()]
)

print("Custom Metric Results:")
print(custom_results.global_scores.summary)

Example Analysis¶

To better understand your model’s performance, analyze individual instances:

# Display detailed results for the first few instances
for i, instance in enumerate(results.instance_scores.data[:3]):
    print(f"\nInstance {i+1}:")
    print(f"Query: {dataset[i]['query']}")
    print(f"Available tools: {dataset[i]['tools']}")
    print(f"Expected tool calls: {dataset[i]['reference_calls']}")
    print(f"Model prediction: {predictions[i]}")
    print(f"Scores: {instance}")

Testing with Different Models¶

You can easily compare different models’ performance:

# Test with a different model
alternative_model = CrossProviderInferenceEngine(
    model="gpt-3.5-turbo",
    provider="openai"
)

alt_predictions = alternative_model(dataset)
alt_results = evaluate(predictions=alt_predictions, data=dataset)

print("Alternative Model Results:")
print(alt_results.global_scores.summary)

Conclusion¶

You have now successfully implemented a tool calling evaluation pipeline with Unitxt using the Berkeley Function Calling Leaderboard dataset. This capability enables the assessment of models’ ability to use tools correctly, opening up new possibilities for AI applications that interact with external systems.

The structured approach using typed definitions (Parameter, Tool, and ToolCall) provides a standardized way to evaluate tool calling capabilities across different models and providers.

We encourage you to explore further by experimenting with different datasets, models, and evaluation metrics to fully leverage Unitxt’s capabilities in tool calling assessment.