Unitxt: streamlining data processing
Unitxt is a Python library for getting data prepared and ready for utilization in training, evaluation and inference of language models. It provides a set of reusable building blocks and methodology for defining datasets and metrics.
In one line of code, it prepares a dataset or mixtures-of-datasets into an input-output format for training and evaluation. Our aspiration is to be simple, adaptable, and transparent.
- Introduction
- Loading Datasets
- Installation
- Adding Datasets
- Adding Stream Operators and Metrics
- Components
- Backend
- Operators
- Contributors Guide
- unitxt
- unitxt package
- Subpackages
- Submodules
- unitxt.api module
- unitxt.artifact module
- unitxt.blocks module
- unitxt.card module
- unitxt.catalog module
- unitxt.collections module
- unitxt.dataclass module
- unitxt.dataset module
- unitxt.dataset_utils module
- unitxt.dict_utils module
- unitxt.eval_utils module
- unitxt.file_utils module
- unitxt.formats module
- unitxt.fusion module
- unitxt.generator_utils module
- unitxt.hf_utils module
- unitxt.instructions module
- unitxt.loaders module
- unitxt.logging_utils module
- unitxt.metric module
- unitxt.metric_utils module
- unitxt.metrics module
- unitxt.normalizers module
- unitxt.operator module
- unitxt.operators module
- unitxt.processors module
- unitxt.random_utils module
- unitxt.recipe module
- unitxt.register module
- unitxt.schema module
- unitxt.split_utils module
- unitxt.splitters module
- unitxt.standard module
- unitxt.stream module
- unitxt.task module
- unitxt.templates module
- unitxt.text_utils module
- unitxt.type_utils module
- unitxt.utils module
- unitxt.validate module
- unitxt.version module
- Module contents
- unitxt package
- Catalog
- Augmentors
- Benchmarks
- Cards
- CFPB
- Ai2_arc
- AlmostEvilML_qa_by_lang
- Amazon_mass
- All_1
- af_ZA
- all
- am_ET
- ar_SA
- az_AZ
- bn_BD
- ca_ES
- cy_GB
- da_DK
- de_DE
- el_GR
- en_US
- es_ES
- fa_IR
- fi_FI
- fr_FR
- he_IL
- hi_IN
- hu_HU
- hy_AM
- id_ID
- is_IS
- it_IT
- ja_JP
- jv_ID
- ka_GE
- km_KH
- kn_IN
- ko_KR
- lv_LV
- ml_IN
- mn_MN
- ms_MY
- my_MM
- nb_NO
- nl_NL
- pl_PL
- pt_PT
- ro_RO
- ru_RU
- sl_SL
- sq_AL
- sv_SE
- sw_KE
- ta_IN
- te_IN
- th_TH
- tl_PH
- tr_TR
- ur_PK
- vi_VN
- zh_CN
- zh_TW
- Belebele
- acm_Arab
- afr_Latn
- als_Latn
- amh_Ethi
- apc_Arab
- arb_Arab
- arb_Latn
- ars_Arab
- ary_Arab
- arz_Arab
- asm_Beng
- azj_Latn
- bam_Latn
- ben_Beng
- ben_Latn
- bod_Tibt
- bul_Cyrl
- cat_Latn
- ceb_Latn
- ces_Latn
- ckb_Arab
- dan_Latn
- deu_Latn
- ell_Grek
- eng_Latn
- est_Latn
- eus_Latn
- fin_Latn
- fra_Latn
- fuv_Latn
- gaz_Latn
- grn_Latn
- guj_Gujr
- hat_Latn
- hau_Latn
- heb_Hebr
- hin_Deva
- hin_Latn
- hrv_Latn
- hun_Latn
- hye_Armn
- ibo_Latn
- ilo_Latn
- ind_Latn
- isl_Latn
- ita_Latn
- jav_Latn
- jpn_Jpan
- kac_Latn
- kan_Knda
- kat_Geor
- kaz_Cyrl
- kea_Latn
- khk_Cyrl
- khm_Khmr
- kin_Latn
- kir_Cyrl
- kor_Hang
- lao_Laoo
- lin_Latn
- lit_Latn
- lug_Latn
- luo_Latn
- lvs_Latn
- mal_Mlym
- mar_Deva
- mkd_Cyrl
- mlt_Latn
- mri_Latn
- mya_Mymr
- nld_Latn
- nob_Latn
- npi_Deva
- npi_Latn
- nso_Latn
- nya_Latn
- ory_Orya
- pan_Guru
- pbt_Arab
- pes_Arab
- plt_Latn
- pol_Latn
- por_Latn
- ron_Latn
- rus_Cyrl
- shn_Mymr
- sin_Latn
- sin_Sinh
- slk_Latn
- slv_Latn
- sna_Latn
- snd_Arab
- som_Latn
- sot_Latn
- spa_Latn
- srp_Cyrl
- ssw_Latn
- sun_Latn
- swe_Latn
- swh_Latn
- tam_Taml
- tel_Telu
- tgk_Cyrl
- tgl_Latn
- tha_Thai
- tir_Ethi
- tsn_Latn
- tso_Latn
- tur_Latn
- ukr_Cyrl
- urd_Arab
- urd_Latn
- uzn_Latn
- vie_Latn
- war_Latn
- wol_Latn
- xho_Latn
- yor_Latn
- zho_Hans
- zho_Hant
- zsm_Latn
- zul_Latn
- Clinc_oos
- Head_qa
- Mlsum
- Mmlu
- abstract_algebra
- anatomy
- astronomy
- business_ethics
- clinical_knowledge
- college_biology
- college_chemistry
- college_computer_science
- college_mathematics
- college_medicine
- college_physics
- computer_security
- conceptual_physics
- econometrics
- electrical_engineering
- elementary_mathematics
- formal_logic
- global_facts
- high_school_biology
- high_school_chemistry
- high_school_computer_science
- high_school_european_history
- high_school_geography
- high_school_government_and_politics
- high_school_macroeconomics
- high_school_mathematics
- high_school_microeconomics
- high_school_physics
- high_school_psychology
- high_school_statistics
- high_school_us_history
- high_school_world_history
- human_aging
- human_sexuality
- international_law
- jurisprudence
- logical_fallacies
- machine_learning
- management
- marketing
- medical_genetics
- miscellaneous
- moral_disputes
- moral_scenarios
- nutrition
- philosophy
- prehistory
- professional_accounting
- professional_law
- professional_medicine
- professional_psychology
- public_relations
- security_studies
- sociology
- us_foreign_policy
- virology
- world_religions
- Multidoc2dial
- Reuters21578
- Winogrande
- Wmt
- Xlsum
- amharic
- arabic
- azerbaijani
- bengali
- burmese
- chinese_simplified
- chinese_traditional
- english
- french
- gujarati
- hausa
- hindi
- igbo
- indonesian
- japanese
- kirundi
- korean
- kyrgyz
- marathi
- nepali
- oromo
- pashto
- persian
- pidgin
- portuguese
- punjabi
- russian
- scottish_gaelic
- serbian_cyrillic
- serbian_latin
- sinhala
- somali
- spanish
- swahili
- tamil
- telugu
- thai
- tigrinya
- turkish
- ukrainian
- urdu
- uzbek
- vietnamese
- welsh
- yoruba
- Xnli
- Xwinogrande
- 20_newsgroups
- ag_news
- almostEvilML_qa
- argument_topic
- atta_q
- banking77
- bold
- boolq
- claim_stance_topic
- cnn_dailymail
- cola
- copa
- dbpedia_14
- ethos_binary
- financial_tweets
- hellaswag
- law_stack_exchange
- ledgar
- mbpp
- medical_abstracts
- mnli
- mrpc
- openbookQA
- openbook_qa
- piqa
- piqa_all
- piqa_high
- piqa_middle
- pop_qa
- qnli
- qqp
- race_all
- race_high
- race_middle
- rte
- sciq
- squad
- sst2
- stsb
- toxigen
- unfair_tos
- wmt_en_de
- wmt_en_fr
- wmt_en_ro
- wnli
- wsc
- xsum
- yahoo_answers_topics
- Formats
- Instructions
- Metrics
- Bert_score
- Perplexity
- Perplexity_a
- Perplexity_chat
- Perplexity_q
- Rag
- Reward
- Sentence_bert
- accuracy
- bleu
- char_edit_dist_accuracy
- f1_macro
- f1_macro_multi_label
- f1_micro
- f1_micro_multi_label
- f1_weighted
- kpa
- map
- matthews_correlation
- mrr
- ndcg
- ner
- normalized_sacrebleu
- precision_macro_multi_label
- precision_micro_multi_label
- recall_macro_multi_label
- recall_micro_multi_label
- regard
- retrieval_at_k
- rouge
- rouge_with_confidence_intervals
- sacrebleu
- safety
- spearman
- squad
- string_containment
- token_overlap
- token_overlap_with_context
- wer
- Operators
- Processors
- convert_to_boolean
- dict_of_lists_to_value_key_pairs
- first_character
- hate_speech_or_not_hate_speech
- list_to_empty_entity_tuples
- load_json
- lower_case
- lower_case_till_punc
- stance_to_pro_con
- string_or_not_hate
- take_first_non_empty_line
- take_first_word
- to_list_by_comma
- to_pairs
- to_span_label_pairs
- to_span_label_pairs_surface_only
- to_string
- to_string_stripped
- to_yes_or_none
- toxic_or_not_toxic
- yes_no_to_int
- Recipes
- Splitters
- Tasks
- Templates