unitxt.struct_data_operators module¶
This section describes unitxt operators for structured data.
These operators are specialized in handling structured data like tables. For tables, expected input format is: {
“header”: [“col1”, “col2”], “rows”: [[“row11”, “row12”], [“row21”, “row22”], [“row31”, “row32”]]
}
For triples, expected input format is: [[ “subject1”, “relation1”, “object1” ], [ “subject1”, “relation2”, “object2”]]
For key-value pairs, expected input format is: {“key1”: “value1”, “key2”: value2, “key3”: “value3”} ————————
- class unitxt.struct_data_operators.ConstructTableFromRowsCols(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, fields: ~typing.List[str], to_field: str)¶
Bases:
InstanceOperatorMaps column and row field into single table field encompassing both header and rows.
field[0] = header string as List field[1] = rows string as List[List] field[2] = table caption string(optional)
- class unitxt.struct_data_operators.ConvertTableColNamesToSequential(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False)¶
Bases:
FieldOperatorReplaces actual table column names with static sequential names like col_0, col_1,…
Sample input: {
“header”: [“name”, “age”], “rows”: [[“Alex”, 21], [“Donald”, 34]]
} Sample output: {
“header”: [“col_0”, “col_1”], “rows”: [[“Alex”, 21], [“Donald”, 34]]
}
- class unitxt.struct_data_operators.DumpJson(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False)¶
Bases:
FieldOperator
- class unitxt.struct_data_operators.DuplicateTableColumns(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, column_indices: List[int] = [], times: int = 1)¶
Bases:
FieldOperatorDuplicates specific columns of a table for the given number of times.
- Parameters:
column_indices (List[int]) –
times (int) –
- column_indices: List[int] = []¶
- class unitxt.struct_data_operators.DuplicateTableRows(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, row_indices: List[int] = [], times: int = 1)¶
Bases:
FieldOperatorDuplicates specific rows of a table for the given number of times.
- Parameters:
row_indices (List[int]) –
times (int) –
- row_indices: List[int] = []¶
- class unitxt.struct_data_operators.InsertEmptyTableRows(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, times: int = 0)¶
Bases:
FieldOperatorInserts empty rows in a table randomly for the given number of times.
- Parameters:
times (int) –
- class unitxt.struct_data_operators.ListToKeyValPairs(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, fields: ~typing.List[str], to_field: str)¶
Bases:
InstanceOperatorMaps list of keys and values into key:value pairs.
Sample input in expected format: {“keys”: [“name”, “age”, “sex”], “values”: [“Alex”, 31, “M”]} Sample output: {“name”: “Alex”, “age”: 31, “sex”: “M”}
- class unitxt.struct_data_operators.LoadJson(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, failure_value: Any = None, allow_failure: bool = False)¶
Bases:
FieldOperator
- class unitxt.struct_data_operators.MapHTMLTableToJSON(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['bs4'], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False)¶
Bases:
FieldOperatorConverts HTML table format to the basic one (JSON).
JSON format {
“header”: [“col1”, “col2”], “rows”: [[“row11”, “row12”], [“row21”, “row22”], [“row31”, “row32”]]
}
- class unitxt.struct_data_operators.MapTableListsToStdTableJSON(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False)¶
Bases:
FieldOperatorConverts lists table format to the basic one (JSON).
JSON format {
“header”: [“col1”, “col2”], “rows”: [[“row11”, “row12”], [“row21”, “row22”], [“row31”, “row32”]]
}
- class unitxt.struct_data_operators.SerializeKeyValPairs(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False)¶
Bases:
FieldOperatorSerializes key, value pairs into a flat sequence.
Sample input in expected format: {“name”: “Alex”, “age”: 31, “sex”: “M”} Sample output: name is Alex, age is 31, sex is M
- class unitxt.struct_data_operators.SerializeTable(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, serialized_type: object = <class 'unitxt.types.Table'>, seed: int = 0, shuffle_rows: bool = False, shuffle_columns: bool = False)¶
Bases:
ABC,TableSerializerTableSerializer converts a given table into a flat sequence with special symbols.
Output format varies depending on the chosen serializer. This abstract class defines structure of a typical table serializer that any concrete implementation should follow.
- class unitxt.struct_data_operators.SerializeTableAsDFLoader(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, serialized_type: object = <class 'unitxt.types.Table'>, seed: int = 0, shuffle_rows: bool = False, shuffle_columns: bool = False)¶
Bases:
SerializeTableDFLoader Table Serializer.
Pandas dataframe based code snippet format serializer. Format(Sample): pd.DataFrame({
“name” : [“Alex”, “Diana”, “Donald”], “age” : [26, 34, 39]
}, index=[0,1,2])
- class unitxt.struct_data_operators.SerializeTableAsHTML(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, serialized_type: object = <class 'unitxt.types.Table'>, seed: int = 0, shuffle_rows: bool = False, shuffle_columns: bool = False)¶
Bases:
SerializeTableHTML Table Serializer.
HTML table format used for rendering tables in web pages. Format(Sample): <table>
- <thead>
<tr><th>name</th><th>age</th><th>sex</th></tr>
</thead> <tbody>
<tr><td>Alice</td><td>26</td><td>F</td></tr> <tr><td>Raj</td><td>34</td><td>M</td></tr>
</tbody>
</table>
- class unitxt.struct_data_operators.SerializeTableAsIndexedRowMajor(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, serialized_type: object = <class 'unitxt.types.Table'>, seed: int = 0, shuffle_rows: bool = False, shuffle_columns: bool = False)¶
Bases:
SerializeTableIndexed Row Major Table Serializer.
Commonly used row major serialization format. Format: col : col1 | col2 | col 3 row 1 : val1 | val2 | val3 | val4 row 2 : val1 | …
- class unitxt.struct_data_operators.SerializeTableAsJson(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, serialized_type: object = <class 'unitxt.types.Table'>, seed: int = 0, shuffle_rows: bool = False, shuffle_columns: bool = False)¶
Bases:
SerializeTableJSON Table Serializer.
Json format based serializer. Format(Sample): {
“0”:{“name”:”Alex”,”age”:26}, “1”:{“name”:”Diana”,”age”:34}, “2”:{“name”:”Donald”,”age”:39}
}
- class unitxt.struct_data_operators.SerializeTableAsMarkdown(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, serialized_type: object = <class 'unitxt.types.Table'>, seed: int = 0, shuffle_rows: bool = False, shuffle_columns: bool = False)¶
Bases:
SerializeTableMarkdown Table Serializer.
Markdown table format is used in GitHub code primarily. Format: |col1|col2|col3| |---|—|---| |A|4|1| |I|2|1| …
- class unitxt.struct_data_operators.SerializeTableRowAsList(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, fields: str, to_field: str, max_cell_length: int | None = None)¶
Bases:
InstanceOperatorSerializes a table row as list.
- Parameters:
fields (str) –
to_field (str) –
max_cell_length (int) –
- class unitxt.struct_data_operators.SerializeTableRowAsText(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, fields: str, to_field: str, max_cell_length: int | None = None)¶
Bases:
InstanceOperatorSerializes a table row as text.
- Parameters:
fields (str) –
to_field (str) –
max_cell_length (int) –
- class unitxt.struct_data_operators.SerializeTriples(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False)¶
Bases:
FieldOperatorSerializes triples into a flat sequence.
Sample input in expected format: [[ “First Clearing”, “LOCATION”, “On NYS 52 1 Mi. Youngsville” ], [ “On NYS 52 1 Mi. Youngsville”, “CITY_OR_TOWN”, “Callicoon, New York”]]
Sample output: First Clearing : LOCATION : On NYS 52 1 Mi. Youngsville | On NYS 52 1 Mi. Youngsville : CITY_OR_TOWN : Callicoon, New York
- class unitxt.struct_data_operators.ShuffleTableColumns(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False)¶
Bases:
FieldOperatorShuffles the table columns randomly.
- Sample Input:
- {
“header”: [“name”, “age”], “rows”: [[“Alex”, 26], [“Raj”, 34], [“Donald”, 39]],
}
- Sample Output:
- {
“header”: [“age”, “name”], “rows”: [[26, “Alex”], [34, “Raj”], [39, “Donald”]],
}
- class unitxt.struct_data_operators.ShuffleTableRows(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False)¶
Bases:
FieldOperatorShuffles the input table rows randomly.
Sample Input: {
“header”: [“name”, “age”], “rows”: [[“Alex”, 26], [“Raj”, 34], [“Donald”, 39]],
}
Sample Output: {
“header”: [“name”, “age”], “rows”: [[“Donald”, 39], [“Raj”, 34], [“Alex”, 26]],
}
- class unitxt.struct_data_operators.TransposeTable(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False)¶
Bases:
FieldOperatorTranspose a table.
- Sample Input:
- {
“header”: [“name”, “age”, “sex”], “rows”: [[“Alice”, 26, “F”], [“Raj”, 34, “M”], [“Donald”, 39, “M”]],
}
- Sample Output:
- {
“header”: [” “, “0”, “1”, “2”], “rows”: [[“name”, “Alice”, “Raj”, “Donald”], [“age”, 26, 34, 39], [“sex”, “F”, “M”, “M”]],
}
- class unitxt.struct_data_operators.TruncateTableCells(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, max_length: int = 15, table: str = None, text_output: str | None = None)¶
Bases:
InstanceOperatorLimit the maximum length of cell values in a table to reduce the overall length.
- Parameters:
max_length (int) –
answer (For tasks that produce a cell value as) –
replicated (truncating a cell value should be) –
implementation. (with truncating the corresponding answer as well. This has been addressed in the) –
- class unitxt.struct_data_operators.TruncateTableRows(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: List[List[str]] | Dict[str, str] | None = None, use_query: bool | None = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, rows_to_keep: int = 10)¶
Bases:
FieldOperatorLimits table rows to specified limit by removing excess rows via random selection.
- Parameters:
rows_to_keep (int) –
- unitxt.struct_data_operators.truncate_cell(cell_value, max_len)¶