unitxt.splitters module¶
- class unitxt.splitters.AssignDemosToInstance(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, from_field: str = __required__, to_field: str = __required__, sampler: unitxt.splitters.Sampler = __required__, skip_demoed_instances: bool = False)[source]¶
Bases:
InstanceOperator
- class unitxt.splitters.CloseTextSampler(data_classification_policy: List[str] = None, field: str = __required__)[source]¶
Bases:
Sampler
Selects the samples of instances which are the closest textual match to the given instance.
Comparison is done based on a given field in the instance.
- class unitxt.splitters.ConstantSizeSample(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, from_field: str = __required__, to_field: str = __required__, sampler: unitxt.splitters.Sampler = __required__, skip_demoed_instances: bool = False, sample_size: int = __required__)[source]¶
Bases:
AssignDemosToInstance
- class unitxt.splitters.DiverseLabelsSampler(data_classification_policy: List[str] = None, choices: str = 'choices', labels: str = 'labels', include_empty_label: bool = True)[source]¶
Bases:
Sampler
Selects a balanced sample of instances based on an output field.
(used for selecting demonstrations in-context learning)
The field must contain list of values e.g [‘dog’], [‘cat’], [‘dog’,’cat’,’cow’]. The balancing is done such that each value or combination of values appears as equals as possible in the samples.
The choices param is required and determines which values should be considered.
Example
If choices is [‘dog’,’cat’] , then the following combinations will be considered. [‘’] [‘cat’] [‘dog’] [‘dog’,’cat’]
If the instance contains a value not in the ‘choice’ param, it is ignored. For example, if choices is [‘dog’,’cat’] and the instance field is [‘dog’,’cat’,’cow’], then ‘cow’ is ignored then the instance is considered as [‘dog’,’cat’].
- Parameters:
sample_size (int) – number of samples to extract
choices (str) – name of input field that contains the list of values to balance on
labels (str) – name of output field with labels that must be balanced
- class unitxt.splitters.FixedIndicesSampler(data_classification_policy: List[str] = None, indices: List[int] = __required__)[source]¶
Bases:
Sampler
Selects a fix set of samples based on a list of indices.
- class unitxt.splitters.RandomSampler(data_classification_policy: List[str] = None)[source]¶
Bases:
Sampler
Selects a random sample of instances.
- class unitxt.splitters.RandomSizeSample(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, from_field: str = __required__, to_field: str = __required__, sampler: unitxt.splitters.Sampler = __required__, skip_demoed_instances: bool = False, sample_sizes: List[int] = __required__)[source]¶
Bases:
AssignDemosToInstance
- class unitxt.splitters.RenameSplits(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, mapper: Dict[str, str] = __required__)[source]¶
Bases:
Splitter
- class unitxt.splitters.Sampler(data_classification_policy: List[str] = None)[source]¶
Bases:
Artifact
- class unitxt.splitters.SeparateSplit(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, from_split: str = __required__, to_split_names: List[str] = __required__, to_split_sizes: List[int] = __required__, remove_targets_from_source_split: bool = True)[source]¶
Bases:
Splitter
Separates a split (e.g. train) into several splits (e.g. train1, train2).
- sizes must indicate the size of every split except the last. If no size is give for the last split,
it includes all the examples not allocated to any split.
- class unitxt.splitters.SliceSplit(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, slices: Dict[str, str] = __required__)[source]¶
Bases:
Splitter
- class unitxt.splitters.SplitRandomMix(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, mix: Dict[str, str] = __required__)[source]¶
Bases:
Splitter
Splits a multistream into new streams (splits), whose names, source input stream, and amount of instances, are specified by arg ‘mix’.
The keys of arg ‘mix’, are the names of the new streams, the values are of the form: ‘name-of-source-stream[percentage-of-source-stream]’ Each input instance, of any input stream, is selected exactly once for inclusion in any of the output streams.
Examples: When processing a multistream made of two streams whose names are ‘train’ and ‘test’, by SplitRandomMix(mix = { “train”: “train[99%]”, “validation”: “train[1%]”, “test”: “test” }) the output is a multistream, whose three streams are named ‘train’, ‘validation’, and ‘test’. Output stream ‘train’ is made of randomly selected 99% of the instances of input stream ‘train’, output stream ‘validation’ is made of the remaining 1% instances of input ‘train’, and output stream ‘test’ is made of the whole of input stream ‘test’.
When processing the above input multistream by SplitRandomMix(mix = { “train”: “train[50%]+test[0.1]”, “validation”: “train[50%]+test[0.2]”, “test”: “test[0.7]” }) the output is a multistream, whose three streams are named ‘train’, ‘validation’, and ‘test’. Output stream ‘train’ is made of randomly selected 50% of the instances of input stream ‘train’ + randomly selected 0.1 (i.e., 10%) of the instances of input stream ‘test’. Output stream ‘validation’ is made of the remaining 50% instances of input ‘train’+ randomly selected 0.2 (i.e., 20%) of the original instances of input ‘test’, that were not selected for output ‘train’, and output stream ‘test’ is made of the remaining instances of input ‘test’.
- class unitxt.splitters.Splitter(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None)[source]¶
Bases:
MultiStreamOperator