π SyntacticΒΆ
This metric evaluates whether a modelβs tool call outputs are structurally valid by checking their compliance with the provided tool schema. For each instance, it returns a binary score (True for valid, False for invalid), and aggregates these into a global percentage across all instances. The evaluation covers a wide range of possible issues, including nonexistent functions or parameters, incorrect parameter types, missing required parameters, values outside allowed ranges, JSON schema violations, invalid or empty API specifications, and malformed tool calls. The main reported score, overall_valid (aliased as score), reflects the proportion of calls that are fully valid, making the metric a measure of syntactic and schema-level correctness rather than semantic accuracy. Each metric also contains an explanation describing the errors that it detected (if no errors were found - the explanation will be None).
metrics.tool_calling.reflection.syntactic
Explanation about ReflectionToolCallingMetricSyntacticΒΆ
Measures syntactic and schema validity of tool calls.
Range: [0, 1] (higher indicates less errors). Returns 1.0 if the tool call is valid for each metric, 0.0 otherwise. overall_valid equals 1.0 if all metrics are valid, 0.0 otherwise. Global score is the percentage of valid instances across the dataset.
Scores: - non_existent_function: tool name not found. - non_existent_parameter: argument name not in tool spec. - incorrect_parameter_type: argument type mismatch. - missing_required_parameter: required argument missing. - allowed_values_violation: argument value outside allowed set. - json_schema_violation: call violates JSON schema. - empty_api_spec: no tool spec provided. - invalid_api_spec: tool spec is invalid. - invalid_tool_call: call is not a valid tool invocation. - overall_valid: validity of the call (main score). - score: alias of overall_valid.
Reference: https://github.ibm.com/MLT/LLMEvalKit
Read more about catalog usage here.