π AccuracyΒΆ
Metric that evaluates tool call predictions with reference calls. First generate unique key value pairs for the tool name, and all the parameters (including nested parameter). Reports average accuracy for each key, as well as micro and macro averages across all keys.
Supports only a single reference call per prediction.
metrics.tool_calling.key_value.accuracy
ToolCallKeyValueExtraction(
metric="metrics.accuracy",
)
[source]Explanation about ToolCallKeyValueExtractionΒΆ
Metrics that formulate ToolCall evaluation as a Key Value Extraction task.
Each argument and each nested value are first flatten to a key value.
{ arguments : {βnameβ : βJohnβ, βaddressβ : { βstreetβ : βMain Stβ, βCityβ : βSmallvilleβ } } }
becomes
argument.names = βJohnβ argument.address.street = βMain Stβ argument.address.city = βSmallvileβ
Note that by default, if a parameter is a list of dictionaries, they are flattened with indexes
- { arguments{βaddressesβ[{ βstreetββMain Stβ, βCityββSmallvilleβ } ,
{ βstreetβ : βLog Stβ, βCityβ : βBigCityβ } ] } }
argument.address.0.street = βMain Stβ argument.address.0.city = βSmallvileβ argument.address.1.street = βLog Stβ argument.address.1.city = βBigCityβ
But if each dictionary in the list has a single unique key, it is used instead.
- { arguments{βaddressesβ[ { βhomeβ{ βstreetββMain Stβ, βCityββSmallvilleβ }} ,
{ βworkβ : {βstreetβ : βLog Stβ, βCityβ : βBigCityβ } ] } }
argument.address.home.street = βMain Stβ argument.address.home.city = βSmallvileβ argument.address.work.street = βLog Stβ argument.address.work.city = βBigCityβ
References: metrics.accuracy
Read more about catalog usage here.