unitxt.loaders module¶

This section describes unitxt loaders.

Loaders: Generators of Unitxt Multistreams from existing date sources¶

Unitxt is all about readily preparing of any given data source for feeding into any given language model, and then, post-processing the model’s output, preparing it for any given evaluator.

Through that journey, the data advances in the form of Unitxt Multistream, undergoing a sequential application of various off the shelf operators (i.e, picked from Unitxt catalog), or operators easily implemented by inheriting. The journey starts by a Unitxt Loeader bearing a Multistream from the given datasource. A loader, therefore, is the first item on any Unitxt Recipe.

Unitxt catalog contains several loaders for the most popular datasource formats. All these loaders inherit from Loader, and hence, implementing a loader to expand over a new type of datasource, is straight forward.

Operators in Unitxt catalog: LoadHF : loads from Huggingface dataset. LoadCSV: loads from csv (comma separated value) files LoadFromKaggle: loads datasets from the kaggle.com community site LoadFromIBMCloud: loads a dataset from the IBM cloud. ————————

class unitxt.loaders.LoadCSV(__tags__: ~typing.Dict[str, str] = {}, caching: bool = None, loader_limit: int | None = None, streaming: bool = True, files: ~typing.Dict[str, str], chunksize: int = 1000, sep: str = ',')¶: Bases: Loader

class unitxt.loaders.LoadFromIBMCloud(__tags__: ~typing.Dict[str, str] = {}, _requirements_list: ~typing.List[str] = ['ibm_boto3'], caching: bool = True, loader_limit: int = None, streaming: bool = False, endpoint_url_env: str, aws_access_key_id_env: str, aws_secret_access_key_env: str, bucket_name: str, data_dir: str = None, data_files: ~typing.Sequence[str] | ~typing.Mapping[str, str | ~typing.Sequence[str]])¶: Bases: Loader

class unitxt.loaders.LoadFromKaggle(__tags__: ~typing.Dict[str, str] = {}, _requirements_list: ~typing.List[str] = ['opendatasets'], caching: bool = None, loader_limit: int = None, streaming: bool = False, url: str)¶: Bases: Loader

class unitxt.loaders.LoadFromSklearn(__tags__: ~typing.Dict[str, str] = {}, _requirements_list: ~typing.List[str] = ['sklearn', 'pandas'], caching: bool = None, loader_limit: int = None, streaming: bool = False, dataset_name: str, splits: ~typing.List[str] = ['train', 'test'])¶

Bases: Loader

splits: List[str] = ['train', 'test']¶

class unitxt.loaders.LoadHF(__tags__: ~typing.Dict[str, str] = {}, caching: bool = None, loader_limit: int = None, streaming: bool = True, path: str, name: str | None = None, data_dir: str | None = None, split: str | None = None, data_files: str | ~typing.Sequence[str] | ~typing.Mapping[str, str | ~typing.Sequence[str]] | None = None, filtering_lambda: str | None = None, requirements_list: ~typing.List[str] = [])¶: Bases: Loader

class unitxt.loaders.Loader(__tags__: Dict[str, str] = {}, caching: bool = None, loader_limit: int = None, streaming: bool = False)¶: Bases: SourceOperator

exception unitxt.loaders.MissingKaggleCredentialsError¶: Bases: ValueError

class unitxt.loaders.MultipleSourceLoader(__tags__: ~typing.Dict[str, str] = {}, caching: bool = None, loader_limit: int = None, streaming: bool = False, sources: ~typing.List[~unitxt.loaders.Loader])¶

Bases: Loader

Allow loading data from multiple sources.

Examples: 1) Loading the train split from Huggingface hub and the test set from a local file:

MultipleSourceLoader(loaders = [ LoadHF(path=”public/data”,split=”train”), LoadCSV({“test”: “mytest.csv”}) ])

Loading a test set combined from two files

MultipleSourceLoader(loaders = [ LoadCSV({“test”: “mytest1.csv”}, LoadCSV({“test”: “mytest2.csv”}) ])