unitxt.dataset module¶
- class unitxt.dataset.Dataset(cache_dir: str | None = None, dataset_name: str | None = None, config_name: str | None = None, hash: str | None = None, base_path: str | None = None, info: DatasetInfo | None = None, features: Features | None = None, token: bool | str | None = None, repo_id: str | None = None, data_files: str | list | dict | DataFilesDict | None = None, data_dir: str | None = None, storage_options: dict | None = None, writer_batch_size: int | None = None, **config_kwargs)[source]¶
Bases:
GeneratorBasedBuilder
- as_dataset(split: Split | None = None, run_post_process=True, verification_mode: VerificationMode | str | None = None, in_memory=False) Dataset | DatasetDict [source]¶
Return a Dataset for the specified split.
- Parameters:
split (datasets.Split) – Which subset of the data to return.
run_post_process (bool, defaults to True) – Whether to run post-processing dataset transforms and/or add indexes.
verification_mode ([VerificationMode] or str, defaults to BASIC_CHECKS) – Verification mode determining the checks to run on the downloaded/processed dataset information (checksums/size/splits/…).
in_memory (bool, defaults to False) – Whether to copy the data in-memory.
- Returns:
datasets.Dataset
- Example:
from datasets import load_dataset_builder builder = load_dataset_builder('rotten_tomatoes') builder.download_and_prepare() ds = builder.as_dataset(split='train') print(ds) # prints: # Dataset({ # features: ['text', 'label'], # num_rows: 8530 # })
- as_streaming_dataset(split: str | None = None, base_path: str | None = None) Dict[str, IterableDataset] | IterableDataset [source]¶
- property generators¶