Unitxt: streamlining data processing
Unitxt is a Python library for getting data prepared and ready for utilization in training, evaluation and inference of language models. It provides a set of reusable building blocks and methodology for defining datasets and metrics.
In one line of code, it prepares a dataset or mixtures-of-datasets into an input-output format for training and evaluation. Our aspiration is to be simple, adaptable, and transparent.
- Introduction
- Loading Datasets
- Installation
- Adding Datasets
- Adding Stream Operators and Metrics
- Concepts
- Backend
- Operators
- unitxt
- unitxt package
- Subpackages
- Submodules
- unitxt.artifact module
- unitxt.blocks module
- unitxt.card module
- unitxt.catalog module
- unitxt.collections module
- unitxt.dataclass module
- unitxt.dataset module
- unitxt.dict_utils module
- unitxt.file_utils module
- unitxt.formats module
- unitxt.fusion module
- unitxt.generator_utils module
- unitxt.hf_utils module
- unitxt.instructions module
- unitxt.load module
- unitxt.loaders module
- unitxt.logging module
- unitxt.metric module
- unitxt.metrics module
- unitxt.normalizers module
- unitxt.operator module
- unitxt.operators module
- unitxt.processors module
- unitxt.random_utils module
- unitxt.recipe module
- unitxt.register module
- unitxt.renderers module
- unitxt.schema module
- unitxt.split_utils module
- unitxt.splitters module
- unitxt.standard module
- unitxt.stream module
- unitxt.task module
- unitxt.templates module
- unitxt.text_utils module
- unitxt.type_utils module
- unitxt.utils module
- unitxt.validate module
- unitxt.version module
- Module contents
- unitxt package
- Catalog
- recipes
- templates
- operators
- cards
- mlsum
- yahoo_answers_topics
- 20_newsgroups
- banking77
- almostEvilML_qa_by_lang
- hellaswag
- mmlu
- management
- professional_law
- moral_scenarios
- professional_psychology
- international_law
- medical_genetics
- high_school_microeconomics
- high_school_biology
- professional_accounting
- marketing
- high_school_chemistry
- human_sexuality
- jurisprudence
- high_school_world_history
- high_school_macroeconomics
- college_medicine
- human_aging
- business_ethics
- abstract_algebra
- formal_logic
- prehistory
- moral_disputes
- econometrics
- college_computer_science
- college_biology
- sociology
- college_chemistry
- computer_security
- logical_fallacies
- virology
- college_physics
- us_foreign_policy
- public_relations
- high_school_psychology
- security_studies
- global_facts
- elementary_mathematics
- anatomy
- astronomy
- miscellaneous
- professional_medicine
- high_school_physics
- conceptual_physics
- electrical_engineering
- high_school_european_history
- philosophy
- high_school_government_and_politics
- college_mathematics
- nutrition
- high_school_mathematics
- machine_learning
- world_religions
- high_school_us_history
- clinical_knowledge
- high_school_statistics
- high_school_computer_science
- high_school_geography
- amazon_mass
- am_ET
- ja_JP
- da_DK
- sv_SE
- fi_FI
- zh_CN
- id_ID
- jv_ID
- tl_PH
- af_ZA
- he_IL
- az_AZ
- all
- ca_ES
- ur_PK
- de_DE
- hy_AM
- my_MM
- ka_GE
- zh_TW
- ro_RO
- te_IN
- fr_FR
- tr_TR
- km_KH
- ml_IN
- th_TH
- mn_MN
- all_1
- lv_LV
- is_IS
- ms_MY
- ar_SA
- nl_NL
- pt_PT
- sw_KE
- hu_HU
- hi_IN
- sl_SL
- cy_GB
- fa_IR
- es_ES
- ru_RU
- vi_VN
- bn_BD
- el_GR
- en_US
- kn_IN
- sq_AL
- ta_IN
- it_IT
- nb_NO
- ko_KR
- pl_PL
- reuters21578
- argument_topic
- rte
- unfair_tos
- sciq
- financial_tweets
- openbook_qa
- wmt_en_de
- wnli
- ag_news
- race_middle
- cola
- head_qa
- cnn_dailymail
- ethos_binary
- ai2_arc
- medical_abstracts
- stsb
- claim_stance_topic
- qqp
- almostEvilML_qa
- mrpc
- qnli
- race_all
- law_stack_exchange
- piqa_all
- wmt_en_fr
- belebele
- acm_Arab
- shn_Mymr
- pes_Arab
- srp_Cyrl
- mya_Mymr
- guj_Gujr
- sin_Sinh
- tam_Taml
- dan_Latn
- lin_Latn
- kaz_Cyrl
- mal_Mlym
- ceb_Latn
- kan_Knda
- ars_Arab
- hye_Armn
- ben_Latn
- mkd_Cyrl
- sot_Latn
- arz_Arab
- plt_Latn
- hin_Latn
- afr_Latn
- ckb_Arab
- hau_Latn
- fra_Latn
- lit_Latn
- mlt_Latn
- asm_Beng
- kir_Cyrl
- zul_Latn
- tir_Ethi
- nya_Latn
- snd_Arab
- ukr_Cyrl
- nob_Latn
- tgl_Latn
- nld_Latn
- swe_Latn
- zho_Hant
- lvs_Latn
- als_Latn
- ory_Orya
- fin_Latn
- ibo_Latn
- sin_Latn
- isl_Latn
- lug_Latn
- cat_Latn
- bod_Tibt
- slv_Latn
- kac_Latn
- zho_Hans
- mri_Latn
- ssw_Latn
- som_Latn
- ary_Arab
- sna_Latn
- npi_Deva
- nso_Latn
- apc_Arab
- kin_Latn
- tgk_Cyrl
- ita_Latn
- hat_Latn
- wol_Latn
- tel_Telu
- xho_Latn
- azj_Latn
- heb_Hebr
- pan_Guru
- uzn_Latn
- pol_Latn
- luo_Latn
- tha_Thai
- vie_Latn
- grn_Latn
- sun_Latn
- tur_Latn
- amh_Ethi
- fuv_Latn
- rus_Cyrl
- deu_Latn
- arb_Arab
- urd_Latn
- ilo_Latn
- ces_Latn
- swh_Latn
- bul_Cyrl
- tsn_Latn
- jav_Latn
- bam_Latn
- por_Latn
- urd_Arab
- gaz_Latn
- hun_Latn
- hrv_Latn
- yor_Latn
- jpn_Jpan
- hin_Deva
- lao_Laoo
- eus_Latn
- est_Latn
- kor_Hang
- ben_Beng
- kat_Geor
- slk_Latn
- ell_Grek
- pbt_Arab
- khk_Cyrl
- arb_Latn
- eng_Latn
- war_Latn
- tso_Latn
- kea_Latn
- ind_Latn
- zsm_Latn
- mar_Deva
- spa_Latn
- npi_Latn
- ron_Latn
- khm_Khmr
- wmt_en_ro
- winogrande
- wmt
- xlsum
- gujarati
- chinese_traditional
- vietnamese
- nepali
- pashto
- russian
- korean
- punjabi
- thai
- persian
- serbian_cyrillic
- kirundi
- indonesian
- serbian_latin
- turkish
- oromo
- welsh
- burmese
- hausa
- urdu
- bengali
- portuguese
- french
- marathi
- ukrainian
- tigrinya
- azerbaijani
- tamil
- igbo
- swahili
- hindi
- yoruba
- chinese_simplified
- somali
- english
- japanese
- uzbek
- arabic
- telugu
- sinhala
- kyrgyz
- amharic
- pidgin
- spanish
- scottish_gaelic
- clinc_oos
- sst2
- squad
- piqa_high
- mnli
- boolq
- openbookQA
- xnli
- copa
- wsc
- piqa
- dbpedia_14
- xwinogrande
- ledgar
- race_high
- piqa_middle
- instructions
- formats
- tasks
- metrics
- sentence_bert
- f1_weighted
- reward
- token_overlap
- wer
- rouge_with_confidence_intervals
- f1_micro_multi_label
- spearman
- token_overlap_with_context
- char_edit_dist_accuracy
- rouge
- ner
- accuracy
- retrieval_at_k
- string_containment
- f1_macro
- ndcg
- sacrebleu
- squad
- mrr
- map
- f1_macro_multi_label
- normalized_sacrebleu
- bleu
- bert_score
- matthews_correlation
- f1_micro
- augmentors
- benchmarks
- splitters
- processors
- first_character
- to_string
- to_list_by_comma
- load_json
- list_to_empty_entity_tuples
- hate_speech_or_not_hate_speech
- string_or_not_hate
- take_first_non_empty_line
- toxic_or_not_toxic
- lower_case_till_punc
- to_span_label_pairs_surface_only
- to_pairs
- convert_to_boolean
- dict_of_lists_to_value_key_pairs
- lower_case
- to_span_label_pairs
- to_string_stripped