unitxt.loaders module¶
This section describes unitxt loaders.
Loaders: Generators of Unitxt Multistreams from existing date sources¶
Unitxt is all about readily preparing of any given data source for feeding into any given language model, and then, post-processing the model’s output, preparing it for any given evaluator.
Through that journey, the data advances in the form of Unitxt Multistream, undergoing a sequential application of various off the shelf operators (i.e, picked from Unitxt catalog), or operators easily implemented by inheriting. The journey starts by a Unitxt Loeader bearing a Multistream from the given datasource. A loader, therefore, is the first item on any Unitxt Recipe.
Unitxt catalog contains several loaders for the most popular datasource formats. All these loaders inherit from Loader, and hence, implementing a loader to expand over a new type of datasource, is straight forward.
Operators in Unitxt catalog: LoadHF : loads from Huggingface dataset. LoadCSV: loads from csv (comma separated value) files LoadFromKaggle: loads datasets from the kaggle.com community site LoadFromIBMCloud: loads a dataset from the IBM cloud. ————————
- class unitxt.loaders.LoadCSV(__tags__: ~typing.Dict[str, str] = {}, caching: bool = None, loader_limit: int | None = None, streaming: bool = True, files: ~typing.Dict[str, str], chunksize: int = 1000, sep: str = ',')¶
Bases:
Loader
- class unitxt.loaders.LoadFromIBMCloud(__tags__: ~typing.Dict[str, str] = {}, _requirements_list: ~typing.List[str] = ['ibm_boto3'], caching: bool = True, loader_limit: int = None, streaming: bool = False, endpoint_url_env: str, aws_access_key_id_env: str, aws_secret_access_key_env: str, bucket_name: str, data_dir: str = None, data_files: ~typing.Sequence[str] | ~typing.Mapping[str, str | ~typing.Sequence[str]])¶
Bases:
Loader
- class unitxt.loaders.LoadFromKaggle(__tags__: ~typing.Dict[str, str] = {}, _requirements_list: ~typing.List[str] = ['opendatasets'], caching: bool = None, loader_limit: int = None, streaming: bool = False, url: str)¶
Bases:
Loader
- class unitxt.loaders.LoadFromSklearn(__tags__: ~typing.Dict[str, str] = {}, _requirements_list: ~typing.List[str] = ['sklearn', 'pandas'], caching: bool = None, loader_limit: int = None, streaming: bool = False, dataset_name: str, splits: ~typing.List[str] = ['train', 'test'])¶
Bases:
Loader- splits: List[str] = ['train', 'test']¶
- class unitxt.loaders.LoadHF(__tags__: ~typing.Dict[str, str] = {}, caching: bool = None, loader_limit: int = None, streaming: bool = True, path: str, name: str | None = None, data_dir: str | None = None, split: str | None = None, data_files: str | ~typing.Sequence[str] | ~typing.Mapping[str, str | ~typing.Sequence[str]] | None = None, filtering_lambda: str | None = None, requirements_list: ~typing.List[str] = [])¶
Bases:
Loader
- class unitxt.loaders.Loader(__tags__: Dict[str, str] = {}, caching: bool = None, loader_limit: int = None, streaming: bool = False)¶
Bases:
SourceOperator
- exception unitxt.loaders.MissingKaggleCredentialsError¶
Bases:
ValueError
- class unitxt.loaders.MultipleSourceLoader(__tags__: ~typing.Dict[str, str] = {}, caching: bool = None, loader_limit: int = None, streaming: bool = False, sources: ~typing.List[~unitxt.loaders.Loader])¶
Bases:
LoaderAllow loading data from multiple sources.
Examples: 1) Loading the train split from Huggingface hub and the test set from a local file:
MultipleSourceLoader(loaders = [ LoadHF(path=”public/data”,split=”train”), LoadCSV({“test”: “mytest.csv”}) ])
Loading a test set combined from two files
MultipleSourceLoader(loaders = [ LoadCSV({“test”: “mytest1.csv”}, LoadCSV({“test”: “mytest2.csv”}) ])