unitxt.struct_data_operators module¶
This section describes unitxt operators for structured data.
These operators are specialized in handling structured data like tables. For tables, expected input format is: {
“header”: [“col1”, “col2”], “rows”: [[“row11”, “row12”], [“row21”, “row22”], [“row31”, “row32”]]
}
For triples, expected input format is: [[ “subject1”, “relation1”, “object1” ], [ “subject1”, “relation2”, “object2”]]
For key-value pairs, expected input format is: {“key1”: “value1”, “key2”: value2, “key3”: “value3”} ————————
- class unitxt.struct_data_operators.ConvertTableColNamesToSequential(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
FieldOperator
Replaces actual table column names with static sequential names like col_0, col_1,…
Sample input: {
“header”: [“name”, “age”], “rows”: [[“Alex”, 21], [“Donald”, 34]]
} Sample output: {
“header”: [“col_0”, “col_1”], “rows”: [[“Alex”, 21], [“Donald”, 34]]
}
- class unitxt.struct_data_operators.DumpJson(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
FieldOperator
- class unitxt.struct_data_operators.ListToKeyValPairs(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, fields: ~typing.List[str], to_field: str)¶
Bases:
InstanceOperator
Maps list of keys and values into key:value pairs.
Sample input in expected format: {“keys”: [“name”, “age”, “sex”], “values”: [“Alex”, 31, “M”]} Sample output: {“name”: “Alex”, “age”: 31, “sex”: “M”}
- class unitxt.struct_data_operators.LoadJson(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False, failure_value: ~typing.Any = None, allow_failure: bool = False)¶
Bases:
FieldOperator
- class unitxt.struct_data_operators.MapHTMLTableToJSON(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, _requirements_list: ~typing.List[str] | ~typing.Dict[str, str] = ['bs4'], caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
FieldOperator
Converts HTML table format to the basic one (JSON).
JSON format {
“header”: [“col1”, “col2”], “rows”: [[“row11”, “row12”], [“row21”, “row22”], [“row31”, “row32”]]
}
- class unitxt.struct_data_operators.MapTableListsToStdTableJSON(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
FieldOperator
Converts lists table format to the basic one (JSON).
JSON format {
“header”: [“col1”, “col2”], “rows”: [[“row11”, “row12”], [“row21”, “row22”], [“row31”, “row32”]]
}
- class unitxt.struct_data_operators.SerializeKeyValPairs(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
FieldOperator
Serializes key, value pairs into a flat sequence.
Sample input in expected format: {“name”: “Alex”, “age”: 31, “sex”: “M”} Sample output: name is Alex, age is 31, sex is M
- class unitxt.struct_data_operators.SerializeTable(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
ABC
,FieldOperator
TableSerializer converts a given table into a flat sequence with special symbols.
Output format varies depending on the chosen serializer. This abstract class defines structure of a typical table serializer that any concrete implementation should follow.
- class unitxt.struct_data_operators.SerializeTableAsDFLoader(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
SerializeTable
DFLoader Table Serializer.
Pandas dataframe based code snippet format serializer. Format(Sample): pd.DataFrame({
“name” : [“Alex”, “Diana”, “Donald”], “age” : [26, 34, 39]
}, index=[0,1,2])
- class unitxt.struct_data_operators.SerializeTableAsIndexedRowMajor(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
SerializeTable
Indexed Row Major Table Serializer.
Commonly used row major serialization format. Format: col : col1 | col2 | col 3 row 1 : val1 | val2 | val3 | val4 row 2 : val1 | …
- class unitxt.struct_data_operators.SerializeTableAsJson(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
SerializeTable
JSON Table Serializer.
Json format based serializer. Format(Sample): {
“0”:{“name”:”Alex”,”age”:26}, “1”:{“name”:”Diana”,”age”:34}, “2”:{“name”:”Donald”,”age”:39}
}
- class unitxt.struct_data_operators.SerializeTableAsMarkdown(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
SerializeTable
Markdown Table Serializer.
Markdown table format is used in GitHub code primarily. Format: |col1|col2|col3| |---|—|---| |A|4|1| |I|2|1| …
- class unitxt.struct_data_operators.SerializeTableRowAsList(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, fields: str, to_field: str, max_cell_length: int | None = None)¶
Bases:
InstanceOperator
Serializes a table row as list.
- Parameters:
fields (str) –
to_field (str) –
max_cell_length (int) –
- class unitxt.struct_data_operators.SerializeTableRowAsText(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, fields: str, to_field: str, max_cell_length: int | None = None)¶
Bases:
InstanceOperator
Serializes a table row as text.
- Parameters:
fields (str) –
to_field (str) –
max_cell_length (int) –
- class unitxt.struct_data_operators.SerializeTriples(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
FieldOperator
Serializes triples into a flat sequence.
Sample input in expected format: [[ “First Clearing”, “LOCATION”, “On NYS 52 1 Mi. Youngsville” ], [ “On NYS 52 1 Mi. Youngsville”, “CITY_OR_TOWN”, “Callicoon, New York”]]
Sample output: First Clearing : LOCATION : On NYS 52 1 Mi. Youngsville | On NYS 52 1 Mi. Youngsville : CITY_OR_TOWN : Callicoon, New York
- class unitxt.struct_data_operators.ShuffleTableColumns(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
FieldOperator
Shuffles the table columns randomly.
- Sample Input:
- {
“header”: [“name”, “age”], “rows”: [[“Alex”, 26], [“Raj”, 34], [“Donald”, 39]],
}
- Sample Output:
- {
“header”: [“age”, “name”], “rows”: [[26, “Alex”], [34, “Raj”], [39, “Donald”]],
}
- class unitxt.struct_data_operators.ShuffleTableRows(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶
Bases:
FieldOperator
Shuffles the input table rows randomly.
Sample Input: {
“header”: [“name”, “age”], “rows”: [[“Alex”, 26], [“Raj”, 34], [“Donald”, 39]],
}
Sample Output: {
“header”: [“name”, “age”], “rows”: [[“Donald”, 39], [“Raj”, 34], [“Alex”, 26]],
}
- class unitxt.struct_data_operators.TruncateTableCells(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, max_length: int = 15, table: str = None, text_output: str | None = None)¶
Bases:
InstanceOperator
Limit the maximum length of cell values in a table to reduce the overall length.
- Parameters:
max_length (int) –
answer (For tasks that produce a cell value as) –
replicated (truncating a cell value should be) –
implementation. (with truncating the corresponding answer as well. This has been addressed in the) –
- class unitxt.struct_data_operators.TruncateTableRows(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False, rows_to_keep: int = 10)¶
Bases:
FieldOperator
Limits table rows to specified limit by removing excess rows via random selection.
- Parameters:
rows_to_keep (int) –
- unitxt.struct_data_operators.truncate_cell(cell_value, max_len)¶