unitxt.api module

unitxt.api.create_dataset(task: str | Task, test_set: List[Dict[Any, Any]], train_set: List[Dict[Any, Any]] | None = None, validation_set: List[Dict[Any, Any]] | None = None, split: str | None = None, data_classification_policy: List[str] | None = None, **kwargs) DatasetDict | IterableDatasetDict | Dataset | IterableDataset[source]

Creates dataset from input data based on a specific task.

Parameters:
  • task – The name of the task from the Unitxt Catalog (https://www.unitxt.ai/en/latest/catalog/catalog.tasks.__dir__.html)

  • test_set – required list of instances

  • train_set – optional train_set

  • validation_set – optional validation set

  • split – optional one split to choose

  • data_classification_policy – data_classification_policy

  • **kwargs – Arguments used to load dataset from provided datasets (see load_dataset())

Returns:

DatasetDict

Example

template = Template(…) dataset = create_dataset(task=”tasks.qa.open”, template=template, format=”formats.chatapi”)

unitxt.api.evaluate(predictions: List[str] | None = None, dataset: Dataset | IterableDataset | None = None, data=None, calc_confidence_intervals: bool = True) EvaluationResults[source]
unitxt.api.fill_metadata(**kwargs)[source]
unitxt.api.infer(instance_or_instances, engine: InferenceEngine, dataset_query: str | None = None, return_data: bool = False, return_log_probs: bool = False, return_meta_data: bool = False, previous_messages: List[Dict[str, str]] | None = None, **kwargs)[source]
unitxt.api.load_dataset(dataset_query: str | None = None, split: str | None = None, streaming: bool = False, use_cache: bool | None = None, **kwargs) DatasetDict | IterableDatasetDict | Dataset | IterableDataset[source]

Loads dataset.

If the ‘dataset_query’ argument is provided, then dataset is loaded from a card in local catalog based on parameters specified in the query.

Alternatively, dataset is loaded from a provided card based on explicitly given parameters.

If both are given, then the textual recipe is loaded with the key word args overriding the textual recipe args.

Parameters:
  • dataset_query (str, optional) – A string query which specifies a dataset to load from local catalog or name of specific recipe or benchmark in the catalog. For example, "card=cards.wnli,template=templates.classification.multi_class.relation.default".

  • streaming (bool, False) – When True yields the data as a stream. This is useful when loading very large datasets. Loading datasets as streams avoid loading all the data to memory, but requires the dataset’s loader to support streaming.

  • split (str, optional) – The split of the data to load

  • use_cache (bool, optional) – If set to True, the returned Huggingface dataset is cached on local disk such that if the same dataset is loaded again, it will be loaded from local disk, resulting in faster runs. If set to False, the returned dataset is not cached. If set to None, the value of this parameter will be determined by setting.dataset_cache_default (default is False). Note that if caching is enabled and the dataset card definition is changed, the old version in the cache may be returned. Enable caching only if you are sure you are working with fixed Unitxt datasets and definitions (e.g. running using predefined datasets from the Unitxt catalog).

  • **kwargs – Arguments used to load dataset from provided card, which is not present in local catalog.

Returns:

DatasetDict

Example:
dataset = load_dataset(
    dataset_query="card=cards.stsb,template=templates.regression.two_texts.simple,max_train_instances=5"
)  # card and template must be present in local catalog

# or built programmatically
card = TaskCard(...)
template = Template(...)
loader_limit = 10
dataset = load_dataset(card=card, template=template, loader_limit=loader_limit)
unitxt.api.load_recipe(dataset_query: str | None = None, **kwargs) DatasetRecipe[source]
unitxt.api.object_to_str_without_addresses(obj)[source]

Generates a string representation of a Python object while removing memory address references.

This function is useful for creating consistent and comparable string representations of objects that would otherwise include memory addresses (e.g., <object_name at 0x123abc>), which can vary between executions. By stripping the memory address, the function ensures that the representation is stable and independent of the object’s location in memory.

Parameters:

obj – Any Python object to be converted to a string representation.

Returns:

A string representation of the object with memory addresses removed if present.

Return type:

str

Example

```python class MyClass:

pass

obj = MyClass() print(str(obj)) # “<__main__.MyClass object at 0x7f8b9d4d6e20>” print(to_str_without_addresses(obj)) # “<__main__.MyClass object>” ```

unitxt.api.post_process(predictions, data) List[Dict[str, Any]][source]
unitxt.api.produce(instance_or_instances, dataset_query: str | None = None, **kwargs) Dataset | Dict[str, Any][source]
unitxt.api.select(instance_or_instances, engine: OptionSelectingByLogProbsInferenceEngine, dataset_query: str | None = None, return_data: bool = False, previous_messages: List[Dict[str, str]] | None = None, **kwargs)[source]
unitxt.api.short_hex_hash(value, length=8)[source]