unitxt.splitters module¶

class unitxt.splitters.AssignDemosToInstance(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, from_field: str = __required__, to_field: str = __required__, sampler: unitxt.splitters.Sampler = __required__, skip_demoed_instances: bool = False, sampling_seed: int | NoneType = None)[source]¶: Bases: InstanceOperator

class unitxt.splitters.CloseTextSampler(data_classification_policy: List[str] = None, field: str = __required__)[source]¶

Bases: Sampler

Selects the samples of instances which are the closest textual match to the given instance.

Comparison is done based on a given field in the instance.

class unitxt.splitters.ConstantSizeSample(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, from_field: str = __required__, to_field: str = __required__, sampler: unitxt.splitters.Sampler = __required__, skip_demoed_instances: bool = False, sampling_seed: int | NoneType = None, sample_size: int = __required__)[source]¶: Bases: AssignDemosToInstance

class unitxt.splitters.DiverseLabelsSampler(data_classification_policy: List[str] = None, choices: str = 'choices', labels: str = 'labels', include_empty_label: bool = True)[source]¶

Bases: Sampler

Selects a balanced sample of instances based on an output field.

(used for selecting demonstrations in-context learning)

The field must contain list of values e.g [‘dog’], [‘cat’], [‘dog’,’cat’,’cow’]. The balancing is done such that each value or combination of values appears as equals as possible in the samples.

The choices param is required and determines which values should be considered.

Example

If choices is [‘dog’,’cat’] , then the following combinations will be considered. [‘’] [‘cat’] [‘dog’] [‘dog’,’cat’]

If the instance contains a value not in the ‘choice’ param, it is ignored. For example, if choices is [‘dog’,’cat’] and the instance field is [‘dog’,’cat’,’cow’], then ‘cow’ is ignored then the instance is considered as [‘dog’,’cat’].

Parameters:

sample_size (int) – number of samples to extract
choices (str) – name of input field that contains the list of values to balance on
labels (str) – name of output field with labels that must be balanced

class unitxt.splitters.FixedIndicesSampler(data_classification_policy: List[str] = None, indices: List[int] = __required__)[source]¶

Bases: Sampler

Selects a fix set of samples based on a list of indices.

class unitxt.splitters.RandomSampler(data_classification_policy: List[str] = None)[source]¶

Bases: Sampler

Selects a random sample of instances.

class unitxt.splitters.RandomSizeSample(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, from_field: str = __required__, to_field: str = __required__, sampler: unitxt.splitters.Sampler = __required__, skip_demoed_instances: bool = False, sampling_seed: int | NoneType = None, sample_sizes: List[int] = __required__)[source]¶: Bases: AssignDemosToInstance

class unitxt.splitters.RenameSplits(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, mapper: Dict[str, str] = __required__)[source]¶: Bases: Splitter

class unitxt.splitters.Sampler(data_classification_policy: List[str] = None)[source]¶: Bases: Artifact

class unitxt.splitters.SeparateSplit(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, from_split: str = __required__, to_split_names: List[str] = __required__, to_split_sizes: List[int] = __required__, remove_targets_from_source_split: bool = True)[source]¶

Bases: Splitter

Separates a split (e.g. train) into several splits (e.g. train1, train2).

sizes must indicate the size of every split except the last. If no size is give for the last split,: it includes all the examples not allocated to any split.

class unitxt.splitters.SliceSplit(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, slices: Dict[str, str] = __required__)[source]¶: Bases: Splitter

class unitxt.splitters.SplitRandomMix(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, mix: Dict[str, str] = __required__)[source]¶

Bases: Splitter

Splits a multistream into new streams (splits), whose names, source input stream, and amount of instances, are specified by arg ‘mix’.

The keys of arg ‘mix’, are the names of the new streams, the values are of the form: ‘name-of-source-stream[percentage-of-source-stream]’ Each input instance, of any input stream, is selected exactly once for inclusion in any of the output streams.

Examples: When processing a multistream made of two streams whose names are ‘train’ and ‘test’, by SplitRandomMix(mix = { “train”: “train[99%]”, “validation”: “train[1%]”, “test”: “test” }) the output is a multistream, whose three streams are named ‘train’, ‘validation’, and ‘test’. Output stream ‘train’ is made of randomly selected 99% of the instances of input stream ‘train’, output stream ‘validation’ is made of the remaining 1% instances of input ‘train’, and output stream ‘test’ is made of the whole of input stream ‘test’.

When processing the above input multistream by SplitRandomMix(mix = { “train”: “train[50%]+test[0.1]”, “validation”: “train[50%]+test[0.2]”, “test”: “test[0.7]” }) the output is a multistream, whose three streams are named ‘train’, ‘validation’, and ‘test’. Output stream ‘train’ is made of randomly selected 50% of the instances of input stream ‘train’ + randomly selected 0.1 (i.e., 10%) of the instances of input stream ‘test’. Output stream ‘validation’ is made of the remaining 50% instances of input ‘train’+ randomly selected 0.2 (i.e., 20%) of the original instances of input ‘test’, that were not selected for output ‘train’, and output stream ‘test’ is made of the remaining instances of input ‘test’.

class unitxt.splitters.Splitter(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None)[source]¶: Bases: MultiStreamOperator

unitxt.splitters.get_random_generator_based_on_instance(instance, local_seed=None)[source]¶