📄 Reflection¶

A metric that assesses tool call predictions for both syntactic correctness and semantic validity, using predefined checks combined with LLM-based evaluations. For each instance, it returns a score reflecting its overall validity, as well as a breakdown of the specific checks/metrics that passed or failed, including hallucination check, value format alignment, function selection and agentic constraints satisfaction. Each metric also contains an evidence from the input, an explanation describing the reflection decision, a confidence, and a validity score with a range of 1-5 (higher score -> more valid).

metrics.tool_calling.reflection

ReflectionToolCallingMetric()
[source]

Explanation about ReflectionToolCallingMetric¶

Measures syntactic and semantic validity of tool calls.

The final output contains two main fields: “semantic” and “static” (i.e., semantic). Under the semantics we define two types of metrics: general and function selection.

General metrics evaluate the overall quality and correctness of the tool call. These metrics contains:

General hallucination check: Evaluate whether each parameter value in the function call is correct and directly supported by the provided conversation history and adhere the tool specifications.

Value format alignment: Check if the format of the parameter values aligns with the expected formats defined in the tool specifications.

Function selection metrics evaluate the appropriateness of the selected function for the given context. These metrics include:

Function selection appropriateness: Assess whether the chosen function is suitable for the task at hand.

Agentic constraints satisfaction: Assess whether the proposed tool call satisfies all agentic constraints required for execution.

Static metrics evaluate the syntactic validity of the tool call. It contains the following metrics: - non_existent_function: tool name not found. - non_existent_parameter: argument name not in tool spec. - incorrect_parameter_type: argument type mismatch. - missing_required_parameter: required argument missing. - allowed_values_violation: argument value outside allowed set. - json_schema_violation: call violates JSON schema. - empty_api_spec: no tool spec provided. - invalid_api_spec: tool spec is invalid. - invalid_tool_call: call is not a valid tool invocation. - overall_valid: validity of the call (main score). - score: alias of overall_valid.

Here is an example for a aggregated reflection output after calling reduce. The range of each score is [0, 1] (where higher indicates less errors). {

“static_non_existent_function”: 1.0, “static_non_existent_parameter”: 1.0, “static_incorrect_parameter_type”: 1.0, “static_missing_required_parameter”: 1.0, “static_allowed_values_violation”: 1.0, “static_json_schema_violation”: 1.0, “static_empty_api_spec”: 1.0, “static_invalid_api_spec”: 1.0, “static_invalid_tool_call”: 1.0, “semantic_general_hallucination_check”: 0.0, “semantic_general_value_format_alignment”: 0.0, “semantic_avg_score_general”: 1.0, “semantic_function_selection_appropriateness”: 0.0, “semantic_agentic_constraints_satisfaction”: 0.0, “semantic_avg_score_function_selection”: 1.0, “overall_valid”: 1.0

}

Where overall_valid is the final decision made by the reflection pipeline, indicating whether the tool call is valid or not.

Before the aggregation each metric contains also evidence, explanation, a more fine-grained score, etc.

Reference: https://github.ibm.com/MLT/LLMEvalKit