๐ Reflectionยถ
A metric that assesses tool call predictions for both syntactic correctness and semantic validity, using predefined checks combined with LLM-based evaluations. For each instance, it returns a score reflecting its overall validity, as well as a breakdown of the specific checks/metrics that passed or failed, including hallucination check, value format alignment, function selection and agentic constraints satisfaction. Each metric also contains an evidence from the input, an explanation describing the reflection decision, a confidence, and a validity score with a range of 1-5 (higher score -> more valid).
metrics.tool_calling.reflection
Explanation about ReflectionToolCallingMetricยถ
Measures syntactic and semantic validity of tool calls.
The final output contains two main fields: โsemanticโ and โstaticโ (i.e., semantic). Under the semantics we define two types of metrics: general and function selection.
General metrics evaluate the overall quality and correctness of the tool call. These metrics contains:
General hallucination check: Evaluate whether each parameter value in the function call is correct and directly supported by the provided conversation history and adhere the tool specifications.
Value format alignment: Check if the format of the parameter values aligns with the expected formats defined in the tool specifications.
Function selection metrics evaluate the appropriateness of the selected function for the given context. These metrics include:
Function selection appropriateness: Assess whether the chosen function is suitable for the task at hand.
Agentic constraints satisfaction: Assess whether the proposed tool call satisfies all agentic constraints required for execution.
Static metrics evaluate the syntactic validity of the tool call. It contains the following metrics: - non_existent_function: tool name not found. - non_existent_parameter: argument name not in tool spec. - incorrect_parameter_type: argument type mismatch. - missing_required_parameter: required argument missing. - allowed_values_violation: argument value outside allowed set. - json_schema_violation: call violates JSON schema. - empty_api_spec: no tool spec provided. - invalid_api_spec: tool spec is invalid. - invalid_tool_call: call is not a valid tool invocation. - overall_valid: validity of the call (main score). - score: alias of overall_valid.
Here is an example for a aggregated reflection output after calling reduce. The range of each score is [0, 1] (where higher indicates less errors). {
โstatic_non_existent_functionโ: 1.0, โstatic_non_existent_parameterโ: 1.0, โstatic_incorrect_parameter_typeโ: 1.0, โstatic_missing_required_parameterโ: 1.0, โstatic_allowed_values_violationโ: 1.0, โstatic_json_schema_violationโ: 1.0, โstatic_empty_api_specโ: 1.0, โstatic_invalid_api_specโ: 1.0, โstatic_invalid_tool_callโ: 1.0, โsemantic_general_hallucination_checkโ: 0.0, โsemantic_general_value_format_alignmentโ: 0.0, โsemantic_avg_score_generalโ: 1.0, โsemantic_function_selection_appropriatenessโ: 0.0, โsemantic_agentic_constraints_satisfactionโ: 0.0, โsemantic_avg_score_function_selectionโ: 1.0, โoverall_validโ: 1.0
}
Where overall_valid is the final decision made by the reflection pipeline, indicating whether the tool call is valid or not.
Before the aggregation each metric contains also evidence, explanation, a more fine-grained score, etc.
Reference: https://github.ibm.com/MLT/LLMEvalKit
Read more about catalog usage here.