unitxt.formats module¶
- class unitxt.formats.BaseFormat(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, demos_field: str = 'demos')[source]¶
Bases:
Format
- class unitxt.formats.ChatAPIFormat(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, demos_field: str = 'demos')[source]¶
Bases:
BaseFormat
Formats output for LLM APIs using OpenAI’s chat schema.
Many API services use OpenAI’s chat format as a standard for conversational models. OpenAIFormat prepares the output in this API-compatible format, converting input instances into OpenAI’s structured chat format, which supports both text and multimedia elements, like images.
The formatted output can be easily converted to a dictionary using json.loads() to make it ready for direct use with OpenAI’s API.
Example
Given an input instance:
{ "source": "<img src='https://example.com/image1.jpg'>What's in this image?", "target": "A dog", "instruction": "Help the user.", },
When processed by:
system_format = OpenAIFormat()
The resulting formatted output is:
{ "target": "A dog", "source": '[{"role": "system", "content": "Help the user."}, ' '{"role": "user", "content": [{"type": "image_url", ' '"image_url": {"url": "https://example.com/image1.jpg", "detail": "low"}}, ' '{"type": "text", "text": "What\'s in this image?"}]}]' }
This source field is a JSON-formatted string. To make it ready for OpenAI’s API, you can convert it to a dictionary using json.loads():
import json messages = json.loads(formatted_output["source"]) response = client.chat.completions.create( model="gpt-4o", messages=messages, )
The resulting messages is now a dictionary ready for sending to the OpenAI API.
- class unitxt.formats.Format(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶
Bases:
InstanceOperator
- class unitxt.formats.HFSystemFormat(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['transformers', 'Jinja2'], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, demos_field: str = 'demos', model_name: str = __required__)[source]¶
Bases:
ChatAPIFormat
Formats the complete input for the model using the HuggingFace chat template of a given model.
HFSystemFormat expects the input instance to contain: 1. A field named “system_prompt” whose value is a string (potentially empty) that delivers a task-independent opening text. 2. A field named “source” whose value is a string verbalizing the original values in the instance (as read from the source dataset), in the context of the underlying task. 3. A field named “instruction” that contains a (non-None) string. 4. A field named with the value in arg ‘demos_field’, containing a list of dicts, each dict with fields “source” and “target”, representing a single demo. 5. A field named “target_prefix” that contains a string to prefix the target in each demo, and to end the whole generated prompt.
SystemFormat formats the above fields into a single string to be inputted to the model. This string overwrites field “source” of the instance.
Example
HFSystemFormat(model_name=”HuggingFaceH4/zephyr-7b-beta”)
Uses the template defined the in tokenizer_config.json of the model:
“chat_template”: “{% for message in messages %}n{% if message[‘role’] == ‘user’ %}n{{ ‘<|user|>n’ + message[‘content’] + eos_token }}n{% elif message[‘role’] == ‘system’ %}n{{ ‘<|system|>n’ + message[‘content’] + eos_token }}n{% elif message[‘role’] == ‘assistant’ %}n{{ ‘<|assistant|>n’ + message[‘content’] + eos_token }}n{% endif %}n{% if loop.last and add_generation_prompt %}n{{ ‘<|assistant|>’ }}n{% endif %}n{% endfor %}”,
See more details in https://huggingface.co/docs/transformers/main/en/chat_templating
- class unitxt.formats.SystemFormat(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, demos_field: str = 'demos', demo_format: str = '{source}\\N{target_prefix}{target}\n\n', model_input_format: str = '{system_prompt}\\N{instruction}\\N{demos}{source}\\N{target_prefix}', format_args: Dict[str, str] = {})[source]¶
Bases:
BaseFormat
Generates the whole input to the model, from constant strings that are given as args, and from values found in specified fields of the instance.
Important: formats can use ‘N’ notations that means new-line if no new-line before and no empty string before.
SystemFormat expects the input instance to contain: 1. A field named “system_prompt” whose value is a string (potentially empty) that delivers a task-independent opening text. 2. A field named “source” whose value is a string verbalizing the original values in the instance (as read from the source dataset), in the context of the underlying task. 3. A field named “instruction” that contains a (non-None) string. 4. A field named with the value in arg ‘demos_field’, containing a list of dicts, each dict with fields “source” and “target”, representing a single demo. 5. A field named “target_prefix” that contains a string to prefix the target in each demo, and to end the whole generated prompt
SystemFormat formats the above fields into a single string to be inputted to the model. This string overwrites field “source” of the instance. Formatting is driven by two args: ‘demo_format’ and ‘model_input_format’. SystemFormat also pops fields “system_prompt”, “instruction”, “target_prefix”, and the field containing the demos out from the input instance.
- Parameters:
demos_field (str) – the name of the field that contains the demos, being a list of dicts, each with “source” and “target” keys
demo_format (str) – formatting string for a single demo, combining fields “source” and “target”
model_input_format (str) –
instance) (and "source" of the input) –
demos (together with) –
format_args – Dict[str,str]: additional format args to be used when formatting the different format strings
Example
when input instance:
{ "source": "1+1", "target": "2", "instruction": "Solve the math exercises.", "demos": [{"source": "1+2", "target": "3"}, {"source": "4-2", "target": "2"}] }
is processed by
system_format = SystemFormat( demos_field="demos", demo_format="Input: {source}\nOutput: {target}\n\n", model_input_format="Instruction: {instruction}\n\n{demos}Input: {source}\nOutput: ", )
the resulting instance is:
{ "target": "2", "source": "Instruction: Solve the math exercises.\n\nInput: 1+2\nOutput: 3\n\nInput: 4-2\nOutput: 2\n\nInput: 1+1\nOutput: ", }
- unitxt.formats.apply_capital_new_line_notation(text: str) str [source]¶
Transforms a given string by applying the Capital New Line Notation.
The Capital New Line Notation (N) is designed to manage newline behavior in a string efficiently. This custom notation aims to consolidate multiple newline characters (n) into a single newline under specific conditions, with tailored handling based on whether there’s preceding text. The function distinguishes between two primary scenarios:
1. If there’s text (referred to as a prefix) followed by any number of n characters and then one or more N, the entire sequence is replaced with a single n. This effectively simplifies multiple newlines and notation characters into a single newline when there’s preceding text. 2. If the string starts with n characters followed by N without any text before this sequence, or if N is at the very beginning of the string, the sequence is completely removed. This case is applicable when the notation should not introduce any newlines due to the absence of preceding text.
- Parameters:
text (str) – The input string to be transformed, potentially containing the Capital New Line Notation (N) mixed with actual newline characters (n).
- Returns:
- The string after applying the Capital New Line Notation rules, which either consolidates multiple
newlines and notation characters into a single newline when text precedes them, or removes the notation and any preceding newlines entirely if no text is present before the notation.
- Return type:
str
Examples
>>> apply_capital_new_line_notation("Hello World\\n\\n\N") 'Hello World\\n'
>>> apply_capital_new_line_notation("\\n\\n\NGoodbye World") 'Goodbye World'
>>> apply_capital_new_line_notation("\N") ''