๐Ÿ“„ Reflectionยถ

A metric that assesses tool call predictions for both syntactic correctness and semantic validity, using predefined checks combined with LLM-based evaluations. For each instance, it returns a score reflecting its overall validity, as well as a breakdown of the specific checks/metrics that passed or failed, including hallucination check, value format alignment, function selection and agentic constraints satisfaction. Each metric also contains an evidence from the input, an explanation describing the reflection decision, a confidence, and a validity score with a range of 1-5 (higher score -> more valid).

metrics.tool_calling.reflection

Explanation about ReflectionToolCallingMetricยถ

Measures syntactic and semantic validity of tool calls.

The final output contains two main fields: โ€œsemanticโ€ and โ€œstaticโ€ (i.e., semantic). Under the semantics we define two types of metrics: general and function selection.

General metrics evaluate the overall quality and correctness of the tool call. These metrics contains:

  1. General hallucination check: Evaluate whether each parameter value in the function call is correct and directly supported by the provided conversation history and adhere the tool specifications.

  2. Value format alignment: Check if the format of the parameter values aligns with the expected formats defined in the tool specifications.

Function selection metrics evaluate the appropriateness of the selected function for the given context. These metrics include:

  1. Function selection appropriateness: Assess whether the chosen function is suitable for the task at hand.

  2. Agentic constraints satisfaction: Assess whether the proposed tool call satisfies all agentic constraints required for execution.

Static metrics evaluate the syntactic validity of the tool call. It contains the following metrics: - non_existent_function: tool name not found. - non_existent_parameter: argument name not in tool spec. - incorrect_parameter_type: argument type mismatch. - missing_required_parameter: required argument missing. - allowed_values_violation: argument value outside allowed set. - json_schema_violation: call violates JSON schema. - empty_api_spec: no tool spec provided. - invalid_api_spec: tool spec is invalid. - invalid_tool_call: call is not a valid tool invocation. - overall_valid: validity of the call (main score). - score: alias of overall_valid.

Here is an example for a aggregated reflection output after calling reduce. The range of each score is [0, 1] (where higher indicates less errors). {

โ€œstatic_non_existent_functionโ€: 1.0, โ€œstatic_non_existent_parameterโ€: 1.0, โ€œstatic_incorrect_parameter_typeโ€: 1.0, โ€œstatic_missing_required_parameterโ€: 1.0, โ€œstatic_allowed_values_violationโ€: 1.0, โ€œstatic_json_schema_violationโ€: 1.0, โ€œstatic_empty_api_specโ€: 1.0, โ€œstatic_invalid_api_specโ€: 1.0, โ€œstatic_invalid_tool_callโ€: 1.0, โ€œsemantic_general_hallucination_checkโ€: 0.0, โ€œsemantic_general_value_format_alignmentโ€: 0.0, โ€œsemantic_avg_score_generalโ€: 1.0, โ€œsemantic_function_selection_appropriatenessโ€: 0.0, โ€œsemantic_agentic_constraints_satisfactionโ€: 0.0, โ€œsemantic_avg_score_function_selectionโ€: 1.0, โ€œoverall_validโ€: 1.0

}

Where overall_valid is the final decision made by the reflection pipeline, indicating whether the tool call is valid or not.

Before the aggregation each metric contains also evidence, explanation, a more fine-grained score, etc.

Reference: https://github.ibm.com/MLT/LLMEvalKit

Read more about catalog usage here.