unitxt.fusion module

class unitxt.fusion.BaseFusion(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, origins: ~typing.List[~unitxt.operator.SourceOperator] | ~typing.Dict[str, ~unitxt.operator.SourceOperator], include_splits: ~typing.List[str] | None = None)

Bases: SourceOperator

BaseFusion operator that combines multiple multistreams into one.

Parameters:
  • origins – a dict of named SourceOperator objects (each to yield a MultiStream) or a list thereof, each is specified along with its input, so can generate a MultiStream

  • include_splits – List of splits to include from each input MultiStream. If None, all splits are included.

class unitxt.fusion.FixedFusion(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, origins: ~typing.List[~unitxt.operator.SourceOperator] | ~typing.Dict[str, ~unitxt.operator.SourceOperator], include_splits: ~typing.List[str] | None = None, max_instances_per_origin_split: int | None = None)

Bases: BaseFusion

FixedFusion operator that combines multiple multistreams into one, limiting the number of instances taken from each split of each input multistream.

Parameters:
  • origins – Dict of named SourceOperator objects (each to yield a MultiStream), or a list thereof

  • splits – List of splits (stream_names) to include, over all input multistreams. If None, all splits are included.

  • max_instances_per_origin_split – Number of instances to take from each input split of each input multistream. If None, all instances of each split (that is specified in include_splits) are included in the result.

class unitxt.fusion.WeightedFusion(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, origins: Dict[str, SourceOperator] | List[SourceOperator] = None, include_splits: List[str] | None = None, weights: Dict[str, float | int] | List[int | float] = None, max_total_examples: int = None, ignore_origin_groups: List[str] = ['unitxt'])

Bases: BaseFusion

Fusion operator that combines multiple MultiStream-s.

Parameters:
  • origins – Dict of named MultiStream objects, or a list thereof

  • weights – Dict of named weights for each origin, or a list thereof

  • max_total_examples – Total number of instances to return per returned split. If None, all instances are returned

ignore_origin_groups: List[str] = ['unitxt']