unitxt.metrics module¶
- class unitxt.metrics.ANLS(data_classification_policy: List[str] = None, main_score: str = 'anls', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['anls']}, reference_field: str = 'references', prediction_field: str = 'prediction', threshold: float = 0.5)[source]¶
Bases:
InstanceMetric
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['anls']}¶
- class unitxt.metrics.Accuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['accuracy']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- ci_scores: List[str] = ['accuracy']¶
- prediction_type: Any | str = typing.Any¶
- reduction_map: Dict[str, List[str]] = {'mean': ['accuracy']}¶
- class unitxt.metrics.AccuracyFast(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, main_score: str = 'accuracy', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[IntermediateType] = None)[source]¶
Bases:
ReductionInstanceMetric
[str
,Dict
[str
,float
]]- reduction: AggregationReduction[IntermediateType] = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶
- class unitxt.metrics.AggregationReduction(data_classification_policy: List[str] = None)[source]¶
Bases:
Artifact
,Generic
[IntermediateType
]
- class unitxt.metrics.BertScore(data_classification_policy: List[str] = None, device: str | NoneType = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, main_score: str = 'f1', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] = ['bert_score'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.DictReduction = None, model_name: str = __required__, batch_size: int = 32, model_layer: int = None)[source]¶
Bases:
MapReduceMetric
[str
,Dict
[str
,float
]],TorchDeviceMixin
- reduction: DictReduction = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶
- class unitxt.metrics.BinaryAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy_binary'], _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['accuracy_binary']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
Calculate accuracy for a binary task, using 0.5 as the threshold in the case of float predictions.
- ci_scores: List[str] = ['accuracy_binary']¶
- prediction_type¶
alias of
Union
[float
,int
]
- reduction_map: Dict[str, List[str]] = {'mean': ['accuracy_binary']}¶
- class unitxt.metrics.BinaryMaxAccuracy(data_classification_policy: List[str] = None, main_score: str = 'max_accuracy_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
GlobalMetric
Calculate the maximal accuracy and the decision threshold that achieves it for a binary task with float predictions.
- prediction_type¶
alias of
Union
[float
,int
]
- class unitxt.metrics.BinaryMaxF1(data_classification_policy: List[str] = None, main_score: str = 'max_f1_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = ['max_f1_binary', 'max_f1_binary_neg'], _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1Binary
Calculate the maximal F1 and the decision threshold that achieves it for a binary task with float predictions.
- ci_scores: List[str] = ['max_f1_binary', 'max_f1_binary_neg']¶
- class unitxt.metrics.BulkInstanceMetric(data_classification_policy: List[str] = None, main_score: str = __required__, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = __required__, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'])[source]¶
- class unitxt.metrics.CharEditDistance(data_classification_policy: List[str] = None, main_score: str = 'char_edit_distance', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['char_edit_distance'], _requirements_list: List[str] = ['editdistance'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['char_edit_distance']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- ci_scores: List[str] = ['char_edit_distance']¶
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['char_edit_distance']}¶
- class unitxt.metrics.CharEditDistanceAccuracy(data_classification_policy: List[str] = None, main_score: str = 'char_edit_dist_accuracy', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['char_edit_dist_accuracy'], _requirements_list: List[str] = ['editdistance'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['char_edit_dist_accuracy']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
CharEditDistance
- ci_scores: List[str] = ['char_edit_dist_accuracy']¶
- reduction_map: Dict[str, List[str]] = {'mean': ['char_edit_dist_accuracy']}¶
- class unitxt.metrics.ConfidenceIntervalMixin(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None)[source]¶
Bases:
Artifact
- class unitxt.metrics.CustomF1(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True)[source]¶
Bases:
GlobalMetric
- prediction_type: Any | str = typing.Any¶
- class unitxt.metrics.CustomF1Fuzzy(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True)[source]¶
Bases:
CustomF1
- class unitxt.metrics.Detector(data_classification_policy: List[str] = None, main_score: str = 'detector_score', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['transformers', 'torch'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['detector_score']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], batch_size: int = 32, model_name: str = __required__)[source]¶
Bases:
BulkInstanceMetric
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['detector_score']}¶
- class unitxt.metrics.DictReduction(data_classification_policy: List[str] = None)[source]¶
Bases:
AggregationReduction
[Dict
[str
,float
]]
- class unitxt.metrics.EvaluationInput(prediction: PredictionType, references: List[PredictionType], task_data: Dict[str, Any])[source]¶
Bases:
tuple
,Generic
[PredictionType
]
- class unitxt.metrics.ExactMatchMM(data_classification_policy: List[str] = None, main_score: str = 'exact_match_mm', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['exact_match_mm']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- prediction_type: Any | str = typing.Any¶
- reduction_map: Dict[str, List[str]] = {'mean': ['exact_match_mm']}¶
- class unitxt.metrics.F1(data_classification_policy: List[str] = None, main_score: str = 'f1_macro', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['scikit-learn<=1.5.2'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
GlobalMetric
- prediction_type¶
alias of
str
- class unitxt.metrics.F1Binary(data_classification_policy: List[str] = None, main_score: str = 'f1_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = ['f1_binary', 'f1_binary_neg'], _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
GlobalMetric
Calculate f1 for a binary task, using 0.5 as the threshold in the case of float predictions.
- ci_scores: List[str] = ['f1_binary', 'f1_binary_neg']¶
- prediction_type¶
alias of
Union
[float
,int
]
- class unitxt.metrics.F1BinaryPosOnly(data_classification_policy: List[str] = None, main_score: str = 'f1_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = ['f1_binary', 'f1_binary_neg'], _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1Binary
- class unitxt.metrics.F1Fast(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, main_score: str = 'f1', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = ['scikit-learn', 'regex'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', averages: List[Literal['f1', 'macro', 'micro', 'per_class']] = ['f1', 'micro', 'macro', 'per_class'], ignore_punc: bool = True, ignore_case: bool = True)[source]¶
Bases:
MapReduceMetric
[str
,Tuple
[int
,int
]]- averages: List[Literal['f1', 'macro', 'micro', 'per_class']] = ['f1', 'micro', 'macro', 'per_class']¶
- class unitxt.metrics.F1Macro(data_classification_policy: List[str] = None, main_score: str = 'f1_macro', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['scikit-learn<=1.5.2'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1
- class unitxt.metrics.F1MacroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'f1_macro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1MultiLabel
- class unitxt.metrics.F1Micro(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['scikit-learn<=1.5.2'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1
- class unitxt.metrics.F1MicroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'f1_micro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1MultiLabel
- class unitxt.metrics.F1MultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'f1_macro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
GlobalMetric
,PackageRequirementsMixin
- prediction_type¶
alias of
List
[str
]
- class unitxt.metrics.F1Strings(data_classification_policy: List[str] = None, main_score: str = 'f1_strings', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = {'spacy': 'Please pip install spacy'}, requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['f1_strings']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['f1_strings']}¶
- class unitxt.metrics.F1Weighted(data_classification_policy: List[str] = None, main_score: str = 'f1_weighted', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['scikit-learn<=1.5.2'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1
- class unitxt.metrics.FaithfulnessHHEM(data_classification_policy: List[str] = None, main_score: str = 'hhem_score', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['transformers', 'torch'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['hhem_score']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], batch_size: int = 2, model_name: str = 'vectara/hallucination_evaluation_model')[source]¶
Bases:
BulkInstanceMetric
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['hhem_score']}¶
- class unitxt.metrics.FinQAEval(data_classification_policy: List[str] = None, main_score: str = 'program_accuracy', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['program_accuracy', 'execution_accuracy'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['program_accuracy', 'execution_accuracy']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- ci_scores: List[str] = ['program_accuracy', 'execution_accuracy']¶
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['program_accuracy', 'execution_accuracy']}¶
- class unitxt.metrics.FixedGroupAbsvalNormCohensHParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_cohens_h_paraphrase', <function FixedGroupAbsvalNormCohensHParaphraseAccuracy.<lambda> at 0x7f022c8e70d0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
Accuracy
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_cohens_h_paraphrase', <function FixedGroupAbsvalNormCohensHParaphraseAccuracy.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupAbsvalNormCohensHParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_cohens_h_paraphrase', <function FixedGroupAbsvalNormCohensHParaphraseStringContainment.<lambda> at 0x7f022c8e7280>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StringContainment
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_cohens_h_paraphrase', <function FixedGroupAbsvalNormCohensHParaphraseStringContainment.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupAbsvalNormHedgesGParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_hedges_g_paraphrase', <function FixedGroupAbsvalNormHedgesGParaphraseAccuracy.<lambda> at 0x7f022c8e7430>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
Accuracy
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_hedges_g_paraphrase', <function FixedGroupAbsvalNormHedgesGParaphraseAccuracy.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupAbsvalNormHedgesGParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_hedges_g_paraphrase', <function FixedGroupAbsvalNormHedgesGParaphraseStringContainment.<lambda> at 0x7f022c8e75e0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StringContainment
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_hedges_g_paraphrase', <function FixedGroupAbsvalNormHedgesGParaphraseStringContainment.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupMeanAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean at 0x7f022c930a60>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
Accuracy
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean>, True]}}¶
- class unitxt.metrics.FixedGroupMeanBaselineAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_baseline', <function FixedGroupMeanBaselineAccuracy.<lambda> at 0x7f022c8cce50>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
Accuracy
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_baseline', <function FixedGroupMeanBaselineAccuracy.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupMeanBaselineStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_baseline', <function FixedGroupMeanBaselineStringContainment.<lambda> at 0x7f022c8d91f0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StringContainment
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_baseline', <function FixedGroupMeanBaselineStringContainment.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupMeanParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_paraphrase', <function FixedGroupMeanParaphraseAccuracy.<lambda> at 0x7f022c8d9040>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
Accuracy
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_paraphrase', <function FixedGroupMeanParaphraseAccuracy.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupMeanParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_paraphrase', <function FixedGroupMeanParaphraseStringContainment.<lambda> at 0x7f022c8d93a0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StringContainment
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_paraphrase', <function FixedGroupMeanParaphraseStringContainment.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupMeanStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean at 0x7f022c930a60>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StringContainment
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean>, True]}}¶
- class unitxt.metrics.FixedGroupNormCohensHParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_cohens_h_paraphrase', <function FixedGroupNormCohensHParaphraseAccuracy.<lambda> at 0x7f022c8d99d0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
Accuracy
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_cohens_h_paraphrase', <function FixedGroupNormCohensHParaphraseAccuracy.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupNormCohensHParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_cohens_h_paraphrase', <function FixedGroupNormCohensHParaphraseStringContainment.<lambda> at 0x7f022c8d9b80>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StringContainment
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_cohens_h_paraphrase', <function FixedGroupNormCohensHParaphraseStringContainment.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupNormHedgesGParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_hedges_g_paraphrase', <function FixedGroupNormHedgesGParaphraseAccuracy.<lambda> at 0x7f022c8d9d30>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
Accuracy
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_hedges_g_paraphrase', <function FixedGroupNormHedgesGParaphraseAccuracy.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupNormHedgesGParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_hedges_g_paraphrase', <function FixedGroupNormHedgesGParaphraseStringContainment.<lambda> at 0x7f022c8d9ee0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StringContainment
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_hedges_g_paraphrase', <function FixedGroupNormHedgesGParaphraseStringContainment.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupPDRParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['pdr_paraphrase', <function FixedGroupPDRParaphraseAccuracy.<lambda> at 0x7f022c8d9550>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
Accuracy
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['pdr_paraphrase', <function FixedGroupPDRParaphraseAccuracy.<lambda>>, True]}}¶
- class unitxt.metrics.FixedGroupPDRParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['pdr_paraphrase', <function FixedGroupPDRParaphraseStringContainment.<lambda> at 0x7f022c8d9700>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StringContainment
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['pdr_paraphrase', <function FixedGroupPDRParaphraseStringContainment.<lambda>>, True]}}¶
- class unitxt.metrics.FuzzyNer(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Any | str = typing.List[typing.Tuple[str, str]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True)[source]¶
Bases:
CustomF1Fuzzy
- prediction_type¶
alias of
List
[Tuple
[str
,str
]]
- class unitxt.metrics.GlobalMetric(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
StreamOperator
,MetricWithConfidenceInterval
A class for computing metrics that require joint calculations over all instances and are not just aggregation of scores of individuals instances.
For example, macro_F1 requires calculation requires calculation of recall and precision per class, so all instances of the class need to be considered. Accuracy, on the other hand, is just an average of the accuracy of all the instances.
- class unitxt.metrics.GraniteGuardianAgenticRisk(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = <RiskType.AGENTIC: 'agentic_risk'>, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶
Bases:
GraniteGuardianBase
- class unitxt.metrics.GraniteGuardianAssistantRisk(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶
Bases:
GraniteGuardianBase
- class unitxt.metrics.GraniteGuardianBase(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = None, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶
Bases:
InstanceMetric
Return metric for different kinds of “risk” from the Granite-3.0 Guardian model.
- available_risks: Dict[RiskType, List[str]] = {RiskType.AGENTIC: ['function_call'], RiskType.ASSISTANT_MESSAGE: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], RiskType.RAG: ['context_relevance', 'groundedness', 'answer_relevance'], RiskType.USER_MESSAGE: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior']}¶
- prediction_type¶
alias of
float
- reduction_map: Dict[str, List[str]] = {}¶
- wml_params = {'decoding_method': 'greedy', 'max_new_tokens': 20, 'return_options': {'input_text': True, 'input_tokens': False, 'top_n_tokens': 5}, 'temperature': 0}¶
- class unitxt.metrics.GraniteGuardianCustomRisk(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = <RiskType.CUSTOM_RISK: 'custom_risk'>, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶
Bases:
GraniteGuardianBase
- class unitxt.metrics.GraniteGuardianRagRisk(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = <RiskType.RAG: 'rag_risk'>, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶
Bases:
GraniteGuardianBase
- class unitxt.metrics.GraniteGuardianUserRisk(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = <RiskType.USER_MESSAGE: 'user_risk'>, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶
Bases:
GraniteGuardianBase
- class unitxt.metrics.GroupMeanAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean at 0x7f022c930a60>, False]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
Accuracy
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean>, False]}}¶
- class unitxt.metrics.GroupMeanStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean at 0x7f022c930a60>, False]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StringContainment
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean>, False]}}¶
- class unitxt.metrics.GroupMeanTokenOverlap(data_classification_policy: List[str] = None, main_score: str = 'f1', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['f1', 'precision', 'recall'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean at 0x7f022c930a60>, False], 'score_fields': ['f1', 'precision', 'recall']}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
TokenOverlap
- reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean>, False], 'score_fields': ['f1', 'precision', 'recall']}}¶
- class unitxt.metrics.HuggingfaceBulkMetric(data_classification_policy: List[str] = None, main_score: str = __required__, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = __required__, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], hf_metric_name: str = __required__, hf_metric_fields: List[str] = __required__, hf_compute_args: dict = {}, hf_additional_input_fields: List = [])[source]¶
Bases:
BulkInstanceMetric
- hf_compute_args: dict = {}¶
- class unitxt.metrics.HuggingfaceInstanceMetric(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = <class 'unitxt.dataclass.Undefined'>, reference_field: str = 'references', prediction_field: str = 'prediction', hf_metric_name: str = __required__, hf_metric_fields: List[str] = __required__, hf_compute_args: dict = {})[source]¶
Bases:
InstanceMetric
- hf_compute_args: dict = {}¶
- class unitxt.metrics.HuggingfaceMetric(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, hf_metric_name: str = None, hf_main_score: str = None, scale: float = 1.0, scaled_fields: list = None, hf_compute_args: Dict[str, Any] = {}, hf_additional_input_fields: List = [], hf_additional_input_fields_pass_one_value: List = [])[source]¶
Bases:
GlobalMetric
- class unitxt.metrics.InstanceMetric(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = <class 'unitxt.dataclass.Undefined'>, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StreamOperator
,MetricWithConfidenceInterval
Class for metrics for which a global score can be calculated by aggregating the instance scores (possibly with additional instance inputs).
InstanceMetric currently allows two reductions:
‘mean’, which calculates the mean of instance scores,
‘group_mean’, which first applies an aggregation function specified in the reduction_map to instance scores grouped by the field grouping_field (which must not be None), and returns the mean of the group scores; if grouping_field is None, grouping is disabled. See _validate_group_mean_reduction for formatting instructions.
- class unitxt.metrics.IsCodeMixed(data_classification_policy: List[str] = None, main_score: str = 'is_code_mixed', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['transformers', 'torch'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['is_code_mixed']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], inference_model: unitxt.inference.InferenceEngine = None)[source]¶
Bases:
BulkInstanceMetric
Uses a generative model to assess whether a given text is code-mixed.
Our goal is to identify whether a text is code-mixed, i.e., contains a mixture of different languages. The model is asked to identify the language of the text; if the model response begins with a number we take this as an indication that the text is code-mixed, for example: - Model response: “The text is written in 2 different languages” vs. - Model response: “The text is written in German”
Note that this metric is quite tailored to specific model-template combinations, as it relies on the assumption that the model will complete the answer prefix “The text is written in ___” in a particular way.
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['is_code_mixed']}¶
- class unitxt.metrics.JaccardIndex(data_classification_policy: List[str] = None, main_score: str = 'jaccard_index', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['jaccard_index'], _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['jaccard_index']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- ci_scores: List[str] = ['jaccard_index']¶
- prediction_type: Any | str = typing.Any¶
- reduction_map: Dict[str, List[str]] = {'mean': ['jaccard_index']}¶
- class unitxt.metrics.KPA(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True)[source]¶
Bases:
CustomF1
- prediction_type¶
alias of
str
- class unitxt.metrics.KendallTauMetric(data_classification_policy: List[str] = None, main_score: str = 'kendalltau_b', prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['scipy'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
GlobalMetric
- prediction_type¶
alias of
float
- class unitxt.metrics.KeyValueExtraction(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Any | str = typing.List[typing.Tuple[str, str]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True)[source]¶
Bases:
CustomF1
F1 Metrics that receives as input a list of (Key,Value) pairs.
- prediction_type¶
alias of
List
[Tuple
[str
,str
]]
- class unitxt.metrics.LlamaIndexCorrectness(data_classification_policy: List[str] = ['public'], main_score: str = '', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = 'correctness_', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['llama-index-core', 'llama-index-llms-openai'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = None, reference_field: str = 'references', prediction_field: str = 'prediction', model_name: str = '', openai_models: List[str] = ['gpt-3.5-turbo'], anthropic_models: List[str] = [], mock_models: List[str] = ['mock'])[source]¶
Bases:
LlamaIndexLLMMetric
LlamaIndex based metric class for evaluating correctness.
- class unitxt.metrics.LlamaIndexFaithfulness(data_classification_policy: List[str] = ['public'], main_score: str = '', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = 'faithfulness_', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['llama-index-core', 'llama-index-llms-openai'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = None, reference_field: str = 'references', prediction_field: str = 'prediction', model_name: str = '', openai_models: List[str] = ['gpt-3.5-turbo'], anthropic_models: List[str] = [], mock_models: List[str] = ['mock'])[source]¶
Bases:
LlamaIndexLLMMetric
LlamaIndex based metric class for evaluating faithfulness.
- class unitxt.metrics.LlamaIndexLLMMetric(data_classification_policy: List[str] = ['public'], main_score: str = '', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['llama-index-core', 'llama-index-llms-openai'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = None, reference_field: str = 'references', prediction_field: str = 'prediction', model_name: str = '', openai_models: List[str] = ['gpt-3.5-turbo'], anthropic_models: List[str] = [], mock_models: List[str] = ['mock'])[source]¶
Bases:
InstanceMetric
- anthropic_models: List[str] = []¶
- data_classification_policy: List[str] = ['public']¶
- external_api_models = ['gpt-3.5-turbo']¶
- mock_models: List[str] = ['mock']¶
- openai_models: List[str] = ['gpt-3.5-turbo']¶
- prediction_type¶
alias of
str
- class unitxt.metrics.MAP(data_classification_policy: List[str] = None, main_score: str = 'map', prediction_type: Any | str = typing.Union[typing.List[str], typing.List[int]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['map'], _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['map']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
RetrievalMetric
- ci_scores: List[str] = ['map']¶
- reduction_map: Dict[str, List[str]] = {'mean': ['map']}¶
- class unitxt.metrics.MRR(data_classification_policy: List[str] = None, main_score: str = 'mrr', prediction_type: Any | str = typing.Union[typing.List[str], typing.List[int]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['mrr'], _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['mrr']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
RetrievalMetric
- ci_scores: List[str] = ['mrr']¶
- reduction_map: Dict[str, List[str]] = {'mean': ['mrr']}¶
- class unitxt.metrics.MapReduceMetric(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
StreamOperator
,Metric
,ConfidenceIntervalMixin
,Generic
[PredictionType
,IntermediateType
]
- class unitxt.metrics.MatthewsCorrelation(data_classification_policy: List[str] = None, main_score: str = 'matthews_correlation', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, hf_metric_name: str = 'matthews_correlation', hf_main_score: str = None, scale: float = 1.0, scaled_fields: list = None, hf_compute_args: Dict[str, Any] = {}, hf_additional_input_fields: List = [], hf_additional_input_fields_pass_one_value: List = [], str_to_id: dict = {})[source]¶
Bases:
HuggingfaceMetric
- prediction_type¶
alias of
str
- class unitxt.metrics.MaxAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'max': ['accuracy']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
Accuracy
Calculate the maximal accuracy over all instances as the global score.
- reduction_map: Dict[str, List[str]] = {'max': ['accuracy']}¶
- class unitxt.metrics.MaxReduction(data_classification_policy: List[str] = None)[source]¶
Bases:
DictReduction
- class unitxt.metrics.MeanReduction(data_classification_policy: List[str] = None)[source]¶
Bases:
DictReduction
- class unitxt.metrics.Meteor(data_classification_policy: List[str] = None, main_score: str = 'meteor', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['meteor'], _requirements_list: List[str] = ['nltk>=3.6.6'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['meteor']}, reference_field: str = 'references', prediction_field: str = 'prediction', alpha: float = 0.9, beta: int = 3, gamma: float = 0.5)[source]¶
Bases:
InstanceMetric
- ci_scores: List[str] = ['meteor']¶
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['meteor']}¶
- class unitxt.metrics.MeteorFast(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, main_score: str = 'meteor', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] = ['nltk>=3.6.6'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[IntermediateType] = None, alpha: float = 0.9, beta: int = 3, gamma: float = 0.5)[source]¶
Bases:
ReductionInstanceMetric
[str
,Dict
[str
,float
]]- reduction: AggregationReduction[IntermediateType] = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶
- class unitxt.metrics.Metric(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '')[source]¶
Bases:
Artifact
- prediction_type: Any | str = typing.Any¶
- class unitxt.metrics.MetricPipeline(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, preprocess_steps: List[unitxt.operator.StreamingOperator] | NoneType = [], postprocess_steps: List[unitxt.operator.StreamingOperator] | NoneType = [], postpreprocess_steps: List[unitxt.operator.StreamingOperator] | NoneType = None, metric: unitxt.metrics.Metric = None)[source]¶
Bases:
MultiStreamOperator
,Metric
- class unitxt.metrics.MetricWithConfidenceInterval(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = None, confidence_level: float = 0.95, ci_scores: List[str] = None)[source]¶
Bases:
Metric
- class unitxt.metrics.MetricsEnsemble(data_classification_policy: List[str] = None, main_score: str = 'ensemble_score', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['ensemble_score']}, reference_field: str = 'references', prediction_field: str = 'prediction', metrics: List[unitxt.metrics.Metric | str] = __required__, weights: List[float] = None)[source]¶
Bases:
InstanceMetric
,ArtifactFetcherMixin
Metrics Ensemble class for creating ensemble of given metrics.
- Parameters:
main_score (str) – The main score label used for evaluation.
metrics (List[Union[Metric, str]]) – List of metrics that will be ensemble.
weights (List[float]) – Weight of each the metrics
reduction_map (Dict[str, List[str]]) – Specifies the redaction method of the global score. InstanceMetric currently allows two reductions (see it definition at InstanceMetric class). This class define its default value to reduce by the mean of the main score.
- reduction_map: Dict[str, List[str]] = {'mean': ['ensemble_score']}¶
- class unitxt.metrics.MetricsList(data_classification_policy: List[str] = None, items: List[unitxt.artifact.Artifact] = [])[source]¶
Bases:
ListCollection
- class unitxt.metrics.NDCG(data_classification_policy: List[str] = None, main_score: str = 'nDCG', prediction_type: Any | str = typing.Union[float, NoneType], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
GlobalMetric
Normalized Discounted Cumulative Gain: measures the quality of ranking with respect to ground truth ranking scores.
As this measures ranking, it is a global metric that can only be calculated over groups of instances. In the common use case where the instances are grouped by different queries, i.e., where the task is to provide a relevance score for a search result w.r.t. a query, an nDCG score is calculated per each query (specified in the “query” input field of an instance) and the final score is the average across all queries. Note that the expected scores are relevance scores (i.e., higher is better) and not rank indices. The absolute value of the scores is only meaningful for the reference scores; for the predictions, only the ordering of the scores affects the outcome - for example, predicted scores of [80, 1, 2] and [0.8, 0.5, 0.6] will receive the same nDCG score w.r.t. a given set of reference scores.
See also https://en.wikipedia.org/wiki/Discounted_cumulative_gain
- prediction_type¶
alias of
Optional
[float
]
- class unitxt.metrics.NER(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Any | str = typing.List[typing.Tuple[str, str]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True)[source]¶
Bases:
CustomF1
F1 Metrics that receives as input a list of (Entity,EntityType) pairs.
- prediction_type¶
alias of
List
[Tuple
[str
,str
]]
- class unitxt.metrics.NLTKMixin(data_classification_policy: List[str] = None)[source]¶
Bases:
Artifact
- class unitxt.metrics.NormalizedSacrebleu(data_classification_policy: List[str] = None, main_score: str = 'sacrebleu', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = ['sacrebleu'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, hf_metric_name: str = 'sacrebleu', hf_main_score: str = 'score', scale: float = 100.0, scaled_fields: list = ['sacrebleu', 'precisions'], hf_compute_args: Dict[str, Any] = {}, hf_additional_input_fields: List = [], hf_additional_input_fields_pass_one_value: List = ['tokenize'])[source]¶
Bases:
HuggingfaceMetric
- hf_additional_input_fields_pass_one_value: List = ['tokenize']¶
- prediction_type¶
alias of
str
- scaled_fields: list = ['sacrebleu', 'precisions']¶
- class unitxt.metrics.Perplexity(data_classification_policy: List[str] = None, main_score: str = 'perplexity', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['transformers', 'torch'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['perplexity']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], source_template: str = __required__, target_template: str = __required__, batch_size: int = 32, model_name: str = __required__, single_token_mode: bool = False)[source]¶
Bases:
BulkInstanceMetric
Computes the likelihood of generating text Y after text X - P(Y|X).
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['perplexity']}¶
- class unitxt.metrics.PrecisionBinary(data_classification_policy: List[str] = None, main_score: str = 'precision_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = ['f1_binary', 'f1_binary_neg'], _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1Binary
- class unitxt.metrics.PrecisionMacroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'precision_macro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1MultiLabel
- class unitxt.metrics.PrecisionMicroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'precision_micro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1MultiLabel
- class unitxt.metrics.PredictionLength(data_classification_policy: List[str] = None, main_score: str = 'prediction_length', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['prediction_length']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
Returns the length of the prediction.
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['prediction_length']}¶
- class unitxt.metrics.RandomForestMetricsEnsemble(data_classification_policy: List[str] = None, main_score: str = 'ensemble_score', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['ensemble_score']}, reference_field: str = 'references', prediction_field: str = 'prediction', metrics: List[unitxt.metrics.Metric | str] = __required__, weights: List[float] = None)[source]¶
Bases:
MetricsEnsemble
This class extends the MetricsEnsemble base class and leverages a pre-trained scikit-learn Random Forest classification model to combine and aggregate scores from multiple judges.
- load_weights method:
Loads model weights from dictionary representation of a random forest classifier.
- ensemble method:
Decodes the RandomForestClassifier object and predict a score based on the given instance.
- class unitxt.metrics.RecallBinary(data_classification_policy: List[str] = None, main_score: str = 'recall_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = ['f1_binary', 'f1_binary_neg'], _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1Binary
- class unitxt.metrics.RecallMacroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'recall_macro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1MultiLabel
- class unitxt.metrics.RecallMicroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'recall_micro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
F1MultiLabel
- class unitxt.metrics.ReductionInstanceMetric(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[~IntermediateType] = __required__)[source]¶
Bases:
MapReduceMetric
[PredictionType
,IntermediateType
],Generic
[PredictionType
,IntermediateType
]
- class unitxt.metrics.RegardMetric(data_classification_policy: List[str] = None, main_score: str = 'regard', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['transformers', 'torch', 'tqdm'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, model_name: str = 'sasha/regardv3', batch_size: int = 32)[source]¶
Bases:
GlobalMetric
- prediction_type: Any | str = typing.Any¶
- class unitxt.metrics.RelaxedCorrectness(data_classification_policy: List[str] = None, main_score: str = 'relaxed_overall', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
GlobalMetric
- prediction_type¶
alias of
str
- class unitxt.metrics.RemoteMetric(data_classification_policy: List[str] = ['public', 'proprietary'], main_score: str = None, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, endpoint: str = __required__, metric_name: str = __required__, api_key: str = None)[source]¶
Bases:
StreamOperator
,Metric
A metric that runs another metric remotely.
main_score: the score updated by this metric. endpoint: the remote host that supports the remote metric execution. metric_name: the name of the metric that is executed remotely. api_key: optional, passed to the remote metric with the input, allows secure authentication.
- data_classification_policy: List[str] = ['public', 'proprietary']¶
- class unitxt.metrics.RerankRecall(data_classification_policy: List[str] = None, main_score: str = 'recall_at_5', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = None, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['pandas', 'pytrec_eval'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, query_id_field: str = 'query_id', passage_id_field: str = 'passage_id', at_k: List[int] = [1, 2, 5])[source]¶
Bases:
GlobalMetric
RerankRecall: measures the quality of reranking with respect to ground truth ranking scores.
This metric measures ranking performance across a dataset. The references for a query will have a score of 1 for the gold passage and 0 for all other passages. The model returns scores in [0,1] for each passage,query pair. This metric measures recall at k by testing that the predicted score for the gold passage,query pair is at least the k’th highest for all passages for that query. A query receives 1 if so, and 0 if not. The 1’s and 0’s are averaged across the dataset.
query_id_field selects the field containing the query id for an instance. passage_id_field selects the field containing the passage id for an instance. at_k selects the value of k used to compute recall.
- at_k: List[int] = [1, 2, 5]¶
- class unitxt.metrics.RetrievalAtK(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Any | str = typing.Union[typing.List[str], typing.List[int]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = None, reference_field: str = 'references', prediction_field: str = 'prediction', k_list: List[int] = __required__)[source]¶
Bases:
RetrievalMetric
- class unitxt.metrics.RetrievalMetric(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Union[typing.List[str], typing.List[int]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = <class 'unitxt.dataclass.Undefined'>, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- prediction_type¶
alias of
Union
[List
[str
],List
[int
]]
- class unitxt.metrics.Reward(data_classification_policy: List[str] = None, device: str | NoneType = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, main_score: str = 'reward_score', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] = ['transformers'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', model_name: str = __required__, batch_size: int = 32)[source]¶
Bases:
MapReduceMetric
[str
,float
],TorchDeviceMixin
- class unitxt.metrics.RiskType(value)[source]¶
Bases:
str
,Enum
Risk type for the Granite Guardian models.
- class unitxt.metrics.RocAuc(data_classification_policy: List[str] = None, main_score: str = 'roc_auc', prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['scikit-learn'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
GlobalMetric
- prediction_type¶
alias of
float
- class unitxt.metrics.Rouge(data_classification_policy: List[str] = None, main_score: str = 'rougeL', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], _requirements_list: List[str] = ['nltk', 'rouge_score'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']}, reference_field: str = 'references', prediction_field: str = 'prediction', rouge_types: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], sent_split_newline: bool = True)[source]¶
Bases:
InstanceMetric
,NLTKMixin
- ci_scores: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']¶
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']}¶
- rouge_types: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']¶
- class unitxt.metrics.RougeHF(data_classification_policy: List[str] = None, main_score: str = 'rougeL', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], _requirements_list: List[str] = ['nltk', 'rouge_score'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']}, reference_field: str = 'references', prediction_field: str = 'prediction', hf_metric_name: str = 'rouge', hf_metric_fields: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], hf_compute_args: dict = {}, rouge_types: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], sent_split_newline: bool = True)[source]¶
Bases:
NLTKMixin
,HuggingfaceInstanceMetric
- ci_scores: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']¶
- hf_metric_fields: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']¶
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']}¶
- rouge_types: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']¶
- class unitxt.metrics.SQLExecutionAccuracy(data_classification_policy: List[str] = None, main_score: str = 'non_empty_execution_accuracy', prediction_type: Any | str = 'Any', single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_result', 'gold_sql_runtime', 'predicted_sql_runtime'], _requirements_list: List[str] | Dict[str, str] = ['sqlglot', 'func_timeout'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_result', 'non_empty_gold_df', 'gold_sql_runtime', 'predicted_sql_runtime', 'pred_to_gold_runtime_ratio', 'gold_error', 'predicted_error']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- ci_scores: List[str] = ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_result', 'gold_sql_runtime', 'predicted_sql_runtime']¶
- reduction_map: Dict[str, List[str]] = {'mean': ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_result', 'non_empty_gold_df', 'gold_sql_runtime', 'predicted_sql_runtime', 'pred_to_gold_runtime_ratio', 'gold_error', 'predicted_error']}¶
- class unitxt.metrics.SQLNonExecutionAccuracy(data_classification_policy: List[str] = None, main_score: str = 'sqlglot_equivalence', prediction_type: Any | str = 'Any', single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['sqlglot_validity', 'sqlparse_validity', 'sqlglot_equivalence', 'sqlglot_optimized_equivalence', 'sqlparse_equivalence', 'sql_exact_match'], _requirements_list: List[str] | Dict[str, str] = ['sqlglot', 'sqlparse'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['sqlglot_validity', 'sqlparse_validity', 'sqlglot_equivalence', 'sqlglot_optimized_equivalence', 'sqlparse_equivalence', 'sql_exact_match']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- ci_scores: List[str] = ['sqlglot_validity', 'sqlparse_validity', 'sqlglot_equivalence', 'sqlglot_optimized_equivalence', 'sqlparse_equivalence', 'sql_exact_match']¶
- reduction_map: Dict[str, List[str]] = {'mean': ['sqlglot_validity', 'sqlparse_validity', 'sqlglot_equivalence', 'sqlglot_optimized_equivalence', 'sqlparse_equivalence', 'sql_exact_match']}¶
- class unitxt.metrics.SafetyMetric(data_classification_policy: List[str] = None, device: Union[str, NoneType] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = ['safety'], main_score: str = 'safety', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] = ['transformers', 'torch'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reward_name: str = 'OpenAssistant/reward-model-deberta-v3-large-v2', batch_size: int = 10, critical_threshold: int = -5, high_threshold: int = -4, medium_threshold: int = -3)[source]¶
Bases:
MapReduceMetric
[str
,Tuple
[float
,str
]],TorchDeviceMixin
The Safety Metric from the paper Unveiling Safety Vulnerabilities of Large Language Models.
The metric is described in the paper: Unveiling Safety Vulnerabilities of Large Language Models. As detailed in the paper, automatically evaluating the potential harm by LLMs requires a harmlessness metric. The model under test should be prompted by each question in the dataset, and the corresponding responses undergo evaluation using a metric that considers both the input and output. Our paper utilizes the “OpenAssistant/reward-model-deberta-v3-large-v2” Reward model, though other models such as “sileod/deberta-v3-large-tasksource-rlhf-reward-model” can also be employed.
- ci_score_names: List[str] = ['safety']¶
- prediction_type¶
alias of
str
- class unitxt.metrics.SentenceBert(data_classification_policy: List[str] = None, device: str | NoneType = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, main_score: str = 'sbert_score', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] = ['sentence_transformers'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', model_name: str = __required__, batch_size: int = 32)[source]¶
Bases:
MapReduceMetric
[str
,float
],TorchDeviceMixin
- class unitxt.metrics.Spearmanr(data_classification_policy: List[str] = None, main_score: str = 'spearmanr', prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, hf_metric_name: str = 'spearmanr', hf_main_score: str = None, scale: float = 1.0, scaled_fields: list = None, hf_compute_args: Dict[str, Any] = {}, hf_additional_input_fields: List = [], hf_additional_input_fields_pass_one_value: List = [])[source]¶
Bases:
HuggingfaceMetric
- prediction_type¶
alias of
float
- class unitxt.metrics.Squad(data_classification_policy: List[str] = None, main_score: str = 'f1', prediction_type: Any | str = typing.Dict[str, typing.Any], single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, hf_metric_name: str = 'squad', hf_main_score: str = None, scale: float = 100.0, scaled_fields: list = ['f1', 'exact_match'], hf_compute_args: Dict[str, Any] = {}, hf_additional_input_fields: List = [], hf_additional_input_fields_pass_one_value: List = [])[source]¶
Bases:
HuggingfaceMetric
- prediction_type¶
alias of
Dict
[str
,Any
]
- scaled_fields: list = ['f1', 'exact_match']¶
- class unitxt.metrics.StringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['string_containment']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- ci_scores: List[str] = ['string_containment']¶
- prediction_type: Any | str = typing.Any¶
- reduction_map: Dict[str, List[str]] = {'mean': ['string_containment']}¶
- class unitxt.metrics.StringContainmentRatio(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['string_containment']}, reference_field: str = 'references', prediction_field: str = 'prediction', field: str = None)[source]¶
Bases:
InstanceMetric
Metric that returns the ratio of values from a specific field contained in the prediction.
- field¶
The field from the task_data that contains the values to be checked for containment.
- Type:
str
Example task that contains this metric:
Task( input_fields={"question": str}, reference_fields={"entities": str}, prediction_type=str, metrics=["string_containment_ratio[field=entities]"], )
- ci_scores: List[str] = ['string_containment']¶
- prediction_type: Any | str = typing.Any¶
- reduction_map: Dict[str, List[str]] = {'mean': ['string_containment']}¶
- class unitxt.metrics.TokenOverlap(data_classification_policy: List[str] = None, main_score: str = 'f1', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['f1', 'precision', 'recall'], _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['f1', 'precision', 'recall']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- ci_scores: List[str] = ['f1', 'precision', 'recall']¶
- prediction_type¶
alias of
str
- reduction_map: Dict[str, List[str]] = {'mean': ['f1', 'precision', 'recall']}¶
- class unitxt.metrics.UnsortedListExactMatch(data_classification_policy: List[str] = None, main_score: str = 'unsorted_list_exact_match', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = ['unsorted_list_exact_match'], _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['unsorted_list_exact_match']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶
Bases:
InstanceMetric
- ci_scores: List[str] = ['unsorted_list_exact_match']¶
- reduction_map: Dict[str, List[str]] = {'mean': ['unsorted_list_exact_match']}¶
- class unitxt.metrics.UpdateStream(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, update: dict = __required__)[source]¶
Bases:
InstanceOperator
- class unitxt.metrics.WebsrcSquadF1(data_classification_policy: List[str] = None, main_score: str = 'websrc_squad_f1', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
GlobalMetric
- DOMAINS = ['auto', 'book', 'camera', 'game', 'jobs', 'movie', 'phone', 'restaurant', 'sports', 'university', 'hotel']¶
- prediction_type: Any | str = typing.Any¶
- class unitxt.metrics.WeightedWinRateCorrelation(data_classification_policy: List[str] = None, main_score: str = 'spearman_corr', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
GlobalMetric
- class unitxt.metrics.Wer(data_classification_policy: List[str] = None, main_score: str = 'wer', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_level: float = 0.95, ci_scores: List[str] = None, _requirements_list: List[str] = ['jiwer'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, hf_metric_name: str = 'wer', hf_main_score: str = None, scale: float = 1.0, scaled_fields: list = None, hf_compute_args: Dict[str, Any] = {}, hf_additional_input_fields: List = [], hf_additional_input_fields_pass_one_value: List = [])[source]¶
Bases:
HuggingfaceMetric
- prediction_type¶
alias of
str
- unitxt.metrics.interpret_effect_size(x: float)[source]¶
Return a string rule-of-thumb interpretation of an effect size value, as defined by Cohen/Sawilowsky.
See Effect sizeCohen, Jacob (1988). Statistical Power Analysis for the Behavioral Sciences; andSawilowsky, S (2009). “New effect size rules of thumb”. Journal of Modern Applied Statistical Methods. 8 (2): 467-474.Value has interpretation of
- essentially 0 if |x| < 0.01 - very small if 0.01 <= |x| < 0.2 - small difference if 0.2 <= |x| < 0.5 - a medium difference if 0.5 <= |x| < 0.8 - a large difference if 0.8 <= |x| < 1.2 - a very large difference if 1.2 <= |x| < 2.0 - a huge difference if 2.0 <= |x|
- Parameters:
x – float effect size value
- Returns:
string interpretation
- unitxt.metrics.mean_subgroup_score(subgroup_scores_dict: Dict[str, List], subgroup_types: List[str])[source]¶
Return the mean instance score for a subset (possibly a single type) of variants (not a comparison).
- Parameters:
subgroup_scores_dict – dict where keys are subgroup types and values are lists of instance scores.
subgroup_types – the keys (subgroup types) for which the average will be computed.
- Returns:
float score
- unitxt.metrics.normalize_answer(s)[source]¶
Lower text and remove punctuation, articles and extra whitespace.
- unitxt.metrics.normalized_cohens_h(subgroup_scores_dict: Dict[str, List], control_subgroup_types: List[str], comparison_subgroup_types: List[str], interpret=False)[source]¶
Cohen’s h effect size between two proportions, normalized to interval [-1,1].
Allows for change-type metric when the baseline is 0 (percentage change, and thus PDR, is undefined) Conhen’s h
Cohen’s h effect size metric between two proportions p2 and p1 is 2 * (arcsin(sqrt(p2)) - arcsin(sqrt(p1))). h in -pi, pi, with +/-pi representing the largest increase/decrease (p1=0, p2=1), or (p1=1, p2=0). h=0 is no change. Unlike percentage change, h is defined even if the baseline (p1) is 0. Assumes the scores are in [0,1], either continuous or binary; hence taking the average of a group of scores yields a proportion.. Calculates the change in the average of the other_scores relative to the average of the baseline_scores. We rescale this to [-1,1] from [-pi,pi] for clarity, where +- 1 are the most extreme changes, and 0 is no change
Interpretation: the original unscaled Cohen’s h can be interpreted according to function interpret_effect_size
Thus, the rule of interpreting the effect of the normalized value is to use the same thresholds divided by pi
- essentially 0 if |norm h| < 0.0031831 - very small if 0.0031831 <= |norm h| < 0.06366198 - small difference if 0.06366198 <= |norm h| < 0.15915494 - a medium difference if 0.15915494 <= |norm h| < 0.25464791 - a large difference if 0.25464791 <= |norm h| < 0.38197186 - a very large difference if 0.38197186 <= |norm h| < 0.63661977 - a huge difference if 0.63661977 <= |norm h|
- Parameters:
subgroup_scores_dict – dict where keys are subgroup types and values are lists of instance scores.
control_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the control (baseline) group
comparison_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the group
group. (to be compared to the control) –
interpret – boolean, whether to interpret the significance of the score or not
- Returns:
float score between -1 and 1, and a string interpretation if interpret=True
- unitxt.metrics.normalized_hedges_g(subgroup_scores_dict: Dict[str, List[float]], control_subgroup_types: List[str], comparison_subgroup_types: List[str], interpret=False)[source]¶
Hedge’s g effect size between mean of two samples, normalized to interval [-1,1]. Better than Cohen’s d for small sample sizes.
Takes into account the variances within the samples, not just the means.
- Parameters:
subgroup_scores_dict – dict where keys are subgroup types and values are lists of instance scores.
control_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the control (baseline) group
comparison_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the group to be compared to the control group.
interpret – boolean, whether to interpret the significance of the score or not
- Returns:
float score between -1 and 1, and a string interpretation if interpret=True
- unitxt.metrics.performance_drop_rate(subgroup_scores_dict: Dict[str, List], control_subgroup_types: List[str], comparison_subgroup_types: List[str])[source]¶
Percentage decrease of mean performance on test elements relative to that on a baseline (control).
from https://arxiv.org/pdf/2306.04528.pdf.
- Parameters:
subgroup_scores_dict – dict where keys are subgroup types and values are lists of instance scores.
control_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the control (baseline) group
comparison_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the group to be compared to the control group.
- Returns:
numeric PDR metric. If only one element (no test set) or the first is 0 (percentage change is undefined) return NaN otherwise, calculate PDR
- unitxt.metrics.validate_subgroup_types(subgroup_scores_dict: Dict[str, List], control_subgroup_types: List[str], comparison_subgroup_types: List[str])[source]¶
Validate a dict of subgroup type instance score lists, and subgroup type lists.
- Parameters:
subgroup_scores_dict – dict where keys are subgroup types and values are lists of instance scores.
control_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the control (baseline) group
comparison_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the group to be compared to the control group.
- Returns:
dict with all NaN scores removed; control_subgroup_types and comparison_subgroup_types will have non-unique elements removed