Evaluate CLI¶
This document describes the command-line interface (CLI) provided by the evaluate_cli.py
script for running language model evaluations using the unitxt
library.
Overview¶
The unitxt-evaluate
CLI streamlines the process of evaluating language models against diverse tasks defined within the unitxt
framework. It manages:
Dataset Loading: Loading and processing datasets according to specified
unitxt
recipes (cards, templates, formats, etc.).- Inference: Generating model predictions using different backends:
Local Hugging Face models via
transformers
(HFAutoModelInferenceEngine
).Remote models accessed through APIs like OpenAI, Anthropic, Cohere, etc., often via
litellm
(CrossProviderInferenceEngine
).
Evaluation: Calculating metrics based on predictions and references.
Reporting: Saving detailed results, configuration, and environment information to JSON files for analysis and reproducibility.
Usage¶
The script is executed from the command line:
python path/to/evaluate_cli.py --tasks <task_definitions> --model <model_type> --model_args <model_arguments> [options]
If evaluate_cli.py
has been installed as an executable script (e.g., via pip install .
with a pyproject.toml
entry point), you might be able to run it directly:
unitxt-evaluate --tasks <task_definitions> --model <model_type> --model_args <model_arguments> [options]
Example¶
Evaluating a remote Llama 3 model on two variations of the BIRD text-to-SQL task, applying a chat template, limiting to 300 validation examples, and saving detailed samples:
unitxt-evaluate \
--tasks "card=cards.text2sql.bird,template=templates.text2sql.you_are_given_no_system+card=cards.text2sql.bird,template=templates.text2sql.you_are_given_no_system_with_hint" \
--model cross_provider \
--model_args "model_name=llama-3-1-405b-instruct,max_tokens=256" \
--split validation \
--limit 300 \
--output_path ./results/bird_remote \
--log_samples \
--verbosity INFO \
--trust_remote_code \
--apply_chat_template \
--batch_size 8
Arguments¶
Task/Dataset Arguments¶
- --tasks <task_definitions>, -t <task_definitions>¶
Required. A plus-separated (
+
) list of one or more task definitions to evaluate. Each individual task definition is a comma-separated string of key-value pairs that specify the components of aunitxt
recipe.Separator: Use
+
to separate different task definitions if evaluating multiple variations or datasets in one run.Format (Single Task):
key1=value1,key2=value2,...
Format (Multiple Tasks):
key1=value1,key2=value2+keyA=valueA,keyB=valueB,...
Common Keys:
card
,template
,format
,num_demos
,max_train_instances
,max_validation_instances
,max_test_instances
, etc. Refer tounitxt
documentation for available recipe parameters.Example (Single):
card=cards.mmlu,template=templates.mmlu.all,num_demos=5
Example (Multiple):
card=cards.mmlu,t=t.mmlu.all+card=cards.hellaswag,t=t.hellaswag.no
(using shorthandt
fortemplate
)
- --split <split_name>¶
The dataset split to load and evaluate (e.g.,
train
,validation
,test
). This should correspond to a split available in the specified card(s). * Default:test
- --num_fewshots <N>¶
Globally specifies the number of few-shot examples (demonstrations) to include in the prompt for all tasks defined in
--tasks
. If set, this automatically adds/overrides the following parameters in each task’s definition:num_demos=N
,demos_taken_from="train"
,demos_pool_size=-1
,demos_removed_from_data=True
. Using this will raise an error ifnum_demos
is also specified directly within any task definition string in--tasks
, as it leads to ambiguity. * Type: integer * Default:None
- --limit <N>, -L <N>¶
Globally limits the number of examples loaded and evaluated per task definition for the specified
--split
. This sets/overrides themax_<split_name>_instances
parameter (e.g.,max_test_instances
if--split test
) for each task. Using this will raise an error ifmax_<split_name>_instances
is also specified directly within any task definition string in--tasks
. * Type: integer * Default:None
(evaluate all available examples in the split)
- --batch_size <N>, -b <N>¶
The batch size for model inference. This parameter is primarily used by the local Hugging Face engine (
--model hf
) viaHFAutoModelInferenceEngine
. Remote providers might handle batching differently or ignore this. * Type: integer * Default:1
Model Arguments¶
- --model <model_type>, -m <model_type>¶
Specifies the type of inference engine (and implicitly the model source) to use. * Choices:
hf
,cross_provider
* ``hf``: Useunitxt.inference.HFAutoModelInferenceEngine
for models loadable viatransformers.AutoModel
. Typically used for local models or those on the Hugging Face Hub. Requirespretrained=<model_id_or_path>
in--model_args
. * ``cross_provider``: Useunitxt.inference.CrossProviderInferenceEngine
, which often leverageslitellm
to interact with various model APIs (OpenAI, Anthropic, Cohere, Vertex AI, self-hosted endpoints, etc.). Requiresmodel_name=<provider/model_id>
(e.g.,openai/gpt-4o
,anthropic/claude-3-opus-20240229
) in--model_args
. * Default:hf
- --model_args <arguments>, -a <arguments>¶
Arguments passed directly to the constructor of the selected inference engine (
HFAutoModelInferenceEngine
orCrossProviderInferenceEngine
), after required keys (pretrained
ormodel_name
) are extracted. Can be provided as a comma-separated string of key-value pairs or as a JSON string. * Format (Key-Value):key1=value1,key2=value2,...
(Values automatically typed as int, float, bool, or string). Example:torch_dtype=bfloat16,device=cuda,trust_remote_code=true
* Format (JSON):'{"key1": "value1", "key2": 123, "key3": true}'
(Use double quotes for JSON keys and string values). * Required Keys:For
--model hf
: Must includepretrained=<model_id_or_path>
.For
--model cross_provider
: Must includemodel_name=<provider/model_id>
.
Engine-Specific Args: Refer to the documentation for
HFAutoModelInferenceEngine
andCrossProviderInferenceEngine
(and potentiallylitellm
forcross_provider
) for available arguments (e.g.,torch_dtype
,device
,quantization_config
forhf
;api_base
,api_key
,max_tokens
,temperature
forcross_provider
). Note: Sensitive keys likeapi_key
are often better handled via environment variables.Merging with ``–gen_kwargs``: Arguments from
--gen_kwargs
are merged into this dictionary before initializing the inference engine. See--gen_kwargs
description.Default:
{}
- --gen_kwargs <arguments>¶
Additional key-value arguments intended specifically for the model’s generation process (e.g., parameters for
model.generate()
in Transformers or equivalent API call parameters). Format is the same as--model_args
(key-value string or JSON). These arguments are merged into the arguments from--model_args
before the inference engine is initialized. If a key exists in both--model_args
and--gen_kwargs
, the value from--gen_kwargs
will take precedence. * Example:temperature=0,top_p=0.9,max_new_tokens=100
* Default:None
- --chat_template_kwargs <arguments>¶
Key-value arguments passed directly to the tokenizer’s
apply_chat_template
method. This is only relevant if--apply_chat_template
is also used. Format is the same as--model_args
(key-value string or JSON). Refer to the Hugging Face Transformers documentation for available arguments. * Example:thinking=True,add_generation_prompt=True
* Default:None
- --apply_chat_template¶
If specified, the script will automatically set the task format to
formats.chat_api
for all tasks defined in--tasks
. This format uses the tokenizer’sapply_chat_template
method to structure the input. Using this flag will raise an error ifformat
is also specified directly within any task definition string in--tasks
. * Default:False
(uses the format specified in the task definition orunitxt
defaults).
Output and Logging Arguments¶
- --output_path <path>, -o <path>¶
Directory where the output JSON files will be saved. The directory will be created if it doesn’t exist. * Default:
.
(current directory)
- --output_file_prefix <prefix>¶
A prefix used for naming the output JSON files. A timestamp (
YYYY-MM-DDTHH:MM:SS
) is automatically prepended to ensure unique filenames. * Example: If--output_file_prefix results_run1
, files might be named2025-04-14T10:05:14_results_run1.json
and2025-04-14T10:05:14_results_run1_samples.json
. * Default:evaluation_results
- --log_samples, -s¶
If specified, a detailed file containing data for each individual evaluated instance will be saved alongside the summary results file. * Filename:
<timestamp>_<prefix>_samples.json
* Default:False
(only the summary results file is saved).
- --verbosity <level>, -v <level>¶
Controls the level of detail in log messages printed to the console. * Choices:
DEBUG
,INFO
,WARNING
,ERROR
,CRITICAL
(case-insensitive) * Default:INFO
Unitxt Settings¶
These arguments configure underlying unitxt
or Hugging Face datasets
behavior.
- --trust_remote_code¶
Allows the execution of Python code defined in remote Hugging Face Hub repositories (e.g., custom code within dataset loading scripts or metrics). Warning: Only enable this if you trust the source of the code. * Default:
False
- --disable_hf_cache¶
Disables the caching mechanism used by the Hugging Face
datasets
library. This forces datasets to be redownloaded and reprocessed. * Default:False
- --cache_dir <path>¶
Specifies a custom directory for the Hugging Face
datasets
cache. This overrides the default location (usually~/.cache/huggingface/datasets
) and theHF_DATASETS_CACHE
/HF_HOME
environment variables for operations within this script. * Default:None
(uses default cache location or environment variables).
Output Files¶
The CLI generates one or two JSON files in the specified --output_path
.
Results Summary File (
<timestamp>_<prefix>.json
) Contains aggregated scores and execution environment details.- ``environment_info`` (object): Details about the execution context:
timestamp_utc
(string): Timestamp of evaluation completion (UTC, ISO 8601).command_line_invocation
(list): The command-line arguments used (sys.argv
).parsed_arguments
(object): Dictionary representation of the parsed command-line arguments.unitxt_version
(string): Installedunitxt
package version (or “N/A”).unitxt_commit_hash
(string): Git commit hash ofunitxt
installation (or “N/A”).python_version
(string): Python interpreter version.system
(string): OS name (e.g., “Linux”, “Darwin”, “Windows”).system_version
(string): OS version details.installed_packages
(object): Dictionary mapping installed Python packages to their versions.
- ``results`` (object): Contains the evaluation scores.
Keys are the task definition strings exactly as provided in the
--tasks
argument.Values are objects containing the calculated metrics for that specific task (e.g.,
"accuracy": 0.85
,"score": 0.85
,"score_name": "accuracy"
, potentially confidence intervals like"accuracy_ci_low"
,"accuracy_ci_high"
).May also include overall summary metrics across all tasks evaluated (e.g., a top-level
"score"
key representing the mean score, often accompanied by"score_name": "subsets_mean"
).
Detailed Samples File (
<timestamp>_<prefix>_samples.json
) Generated only if--log_samples
is specified. Contains instance-level details.``environment_info`` (object): Same structure as in the summary file.
- ``samples`` (object): A dictionary where keys are the task definition strings from
--tasks
. Each value is a list of objects, where each object represents one evaluated instance.
- Instance object keys typically include:
source
: Original input data record.processed
: Input potentially transformed by the recipe (e.g., formatted prompt). May not always be present.prediction
: Raw output generated by the model.references
: List of ground truth reference(s).metrics
: Dictionary of scores calculated for this specific instance.task_data
: Additional metadata from theunitxt
processing steps.Note: The
postprocessors
key used during internal computation is removed before saving.
- ``samples`` (object): A dictionary where keys are the task definition strings from
Frequently Asked Questions (FAQ)¶
Q: Why does ``–tasks`` use ``+`` as a separator? Why not commas or semicolons?
A: The +
separates distinct task definitions. Since each task definition itself is a comma-separated list of key-value pairs (e.g., card=c,template=t
), using commas or semicolons to separate multiple tasks would be ambiguous. The +
provides a clear delimiter between full task recipes.
Q: What’s the difference between ``–model_args`` and ``–gen_kwargs``?
A: Both allow passing key-value arguments.
* --model_args
are primarily intended for arguments needed to initialize the inference engine (e.g., pretrained
, device
, torch_dtype
, model_name
, max_tokens
).
* --gen_kwargs
are intended for arguments controlling the generation process itself (e.g., temperature
, top_p
, do_sample
).
* Important: Arguments from --gen_kwargs
are merged into --model_args
before the engine is initialized, with --gen_kwargs
values overwriting any conflicting keys from --model_args
.
Q: I’m getting `AttributeError: ‘Namespace’ object has no attribute ‘batch_size’` (or similar) in my tests. A: When manually creating an argparse.Namespace object in a test (e.g., args = argparse.Namespace(…)), ensure you include all attributes that the code under test might access, even if they have default values in the real parser. Check the setup_parser function for defaults (like batch_size=1).
Q: I’m getting `UnitxtArtifactNotFoundError: Artifact ‘some_name’ does not exist…`
A: This means unitxt
cannot find an artifact (like a card, template, metric) you specified.
* Double-check the spelling and full name (e.g., cards.common_sense.hellaswag
) in your --tasks
definition.
* Ensure the artifact exists in the default unitxt
catalog or any custom catalog paths you might have configured.
* Check for typos in keys (e.g., templete= instead of template=).
Q: The CLI fails with an error about invalid JSON for ``–model_args`` (or ``–gen_kwargs`` / ``–chat_template_kwargs``).
A: If providing arguments as a JSON string, ensure it’s valid:
* Wrap the entire JSON string in single quotes (for the shell) or escape double quotes appropriately.
* Use double quotes ("
) for all keys and string values inside the JSON.
* Example: --model_args '{"pretrained": "my/model", "some_flag": true, "count": 10}'
Q: I get `ValueError: Argument ‘pretrained’ is required…` or `ValueError: Argument ‘model_name’ is required…`
A: You must provide the correct identifier key within --model_args
based on your selected --model
type:
* If --model hf
, include pretrained=<model_id_or_path>
in --model_args
.
* If --model cross_provider
, include model_name=<provider/model_id>
in --model_args
.
Q: How do global arguments like ``–limit``, ``–num_fewshots``, ``–apply_chat_template`` interact with task-specific arguments in ``–tasks``?
A: The global CLI arguments generally take precedence.
* If you provide --limit N
, it will set max_<split>_instances=N for all tasks, potentially overwriting values set inside the --tasks
string. The script includes checks to error out if you provide both the CLI argument and a corresponding key within the same task string in --tasks
(e.g., --limit 10
and ...,max_test_instances=5
in --tasks
when --split test
).
* Similar precedence and conflict checks apply to --num_fewshots
(vs num_demos
) and --apply_chat_template
(vs format
).
Q: Where do I put API keys (like OpenAI API key) for ``–model cross_provider``?
A: For security, do not pass sensitive API keys directly via --model_args
. CrossProviderInferenceEngine
typically relies on litellm
, which finds keys through standard methods:
* Environment Variables: (Recommended) Set environment variables like OPENAI_API_KEY
, ANTHROPIC_API_KEY
, etc., before running the script.
* LiteLLM Config File: Configure keys in a litellm configuration file.
* Refer to the litellm documentation for managing API keys.
Q: The `unitxt_commit_hash` in my output is “N/A”. Why? A: The script tries to get the commit hash using the git rev-parse HEAD command within the detected installation directory of the unitxt package. This might fail if: * The unitxt package was not installed from a Git repository (e.g., installed from PyPI as a standard package). * The git command is not available in your system’s PATH. * The script cannot correctly determine the unitxt package location or it’s not within a recognizable Git repository structure.
Troubleshooting¶
Argument Parsing Errors: Double-check formatting for JSON/key-value strings, ensure required keys like
pretrained
/model_name
are present, and verify the+
separator for--tasks
.Artifact Not Found Errors: Verify artifact names (cards, templates, etc.) and catalog accessibility. Check for typos.
Dependency Errors: Ensure
unitxt
,datasets
,transformers
are installed. Forhf
models,torch
and possiblyaccelerate
are needed. Forcross_provider
,litellm
and potentially provider-specific libraries (likeopenai
) are needed.Remote Model Errors (cross_provider): Verify API keys (via environment variables), model identifiers (e.g.,
openai/gpt-4o
), quotas, network connectivity, and any necessarylitellm
configuration.CUDA/Device Errors (hf): Ensure GPU drivers/CUDA toolkit are correctly installed and configured if using
device=cuda
in--model_args
. Check available GPU memory.Conflicting Arguments: Avoid specifying arguments both globally (e.g.,
--limit
) and within the--tasks
string for the same parameter (e.g.,max_test_instances
) – the script should raise an error if this happens.