unitxt.dataset module

class unitxt.dataset.Dataset(cache_dir: str | None = None, dataset_name: str | None = None, config_name: str | None = None, hash: str | None = None, base_path: str | None = None, info: DatasetInfo | None = None, features: Features | None = None, token: bool | str | None = None, repo_id: str | None = None, data_files: str | list | dict | DataFilesDict | None = None, data_dir: str | None = None, storage_options: dict | None = None, writer_batch_size: int | None = None, **config_kwargs)[source]

Bases: GeneratorBasedBuilder

as_dataset(split: Split | None = None, run_post_process=True, verification_mode: VerificationMode | str | None = None, in_memory=False) Dataset | DatasetDict[source]

Return a Dataset for the specified split.

Parameters:
  • split (datasets.Split) – Which subset of the data to return.

  • run_post_process (bool, defaults to True) – Whether to run post-processing dataset transforms and/or add indexes.

  • verification_mode ([VerificationMode] or str, defaults to BASIC_CHECKS) – Verification mode determining the checks to run on the downloaded/processed dataset information (checksums/size/splits/…).

  • in_memory (bool, defaults to False) – Whether to copy the data in-memory.

Returns:

datasets.Dataset

Example:

from datasets import load_dataset_builder
builder = load_dataset_builder('rotten_tomatoes')
builder.download_and_prepare()
ds = builder.as_dataset(split='train')
print(ds)
# prints:
# Dataset({
#     features: ['text', 'label'],
#     num_rows: 8530
# })
as_streaming_dataset(split: str | None = None, base_path: str | None = None) Dict[str, IterableDataset] | IterableDataset[source]
property generators