πŸ“„ Tool Calling CorrectnessΒΆ

metrics.llm_as_judge.direct.criteria.tool_calling_correctness

CriteriaWithOptions(
    name="tool_calling_correctness",
    description="The response correctly uses tool calls as expected, including the right tool names and parameters, in line with the reference or user query and instructions.",
    prediction_field=None,
    context_fields=None,
    options=[
        CriteriaOption(
            name="Excellent",
            description="All tool calls are correct, including names and parameters, matching the reference or user expectations precisely.",
        ),
        CriteriaOption(
            name="Good",
            description="Tool calls are mostly correct with minor errors that do not affect the functionality or intent.",
        ),
        CriteriaOption(
            name="Mediocre",
            description="The response attempts tool calls with partial correctness, but has notable issues in tool names, structure, or parameters.",
        ),
        CriteriaOption(
            name="Bad",
            description="The tool calling logic is largely incorrect, with significant mistakes in tool usage or missing key calls.",
        ),
        CriteriaOption(
            name="Very Bad",
            description="The tool calls are completely incorrect, irrelevant, or missing when clearly required.",
        ),
    ],
    option_map={
        "Excellent": 1.0,
        "Good": 0.75,
        "Mediocre": 0.5,
        "Bad": 0.25,
        "Very Bad": 0.0,
    },
)
[source]

from unitxt.llm_as_judge_constants import CriteriaOption

Explanation about CriteriaWithOptionsΒΆ

Criteria used by DirectLLMJudge to run evaluations.

Explanation about CriteriaOptionΒΆ

A criteria option.

Read more about catalog usage here.