Unitxt: streamlining data processing
Unitxt is a Python library for getting data prepared and ready for utilization in training, evaluation and inference of language models. It provides a set of reusable building blocks and methodology for defining datasets and metrics.
In one line of code, it prepares a dataset or mixtures-of-datasets into an input-output format for training and evaluation. Our aspiration is to be simple, adaptable, and transparent.
- Introduction
- Loading Datasets
- Installation
- Adding Datasets
- Adding Stream Operators and Metrics
- Concepts
- Backend
- Operators
- Contributors Guide
- unitxt
- unitxt package
- Subpackages
- Submodules
- unitxt.artifact module
- unitxt.blocks module
- unitxt.card module
- unitxt.catalog module
- unitxt.collections module
- unitxt.dataclass module
- unitxt.dataset module
- unitxt.dict_utils module
- unitxt.file_utils module
- unitxt.formats module
- unitxt.fusion module
- unitxt.generator_utils module
- unitxt.hf_utils module
- unitxt.instructions module
- unitxt.load module
- unitxt.loaders module
- unitxt.logging_utils module
- unitxt.metric module
- unitxt.metrics module
- unitxt.normalizers module
- unitxt.operator module
- unitxt.operators module
- unitxt.processors module
- unitxt.random_utils module
- unitxt.recipe module
- unitxt.register module
- unitxt.schema module
- unitxt.split_utils module
- unitxt.splitters module
- unitxt.standard module
- unitxt.stream module
- unitxt.task module
- unitxt.templates module
- unitxt.text_utils module
- unitxt.type_utils module
- unitxt.utils module
- unitxt.validate module
- unitxt.version module
- Module contents
- unitxt package
- Catalog
- recipes
- templates
- operators
- cards
- mlsum
- yahoo_answers_topics.json
- 20_newsgroups.json
- banking77.json
- almostEvilML_qa_by_lang
- hellaswag.json
- mmlu
- management.json
- professional_law.json
- moral_scenarios.json
- professional_psychology.json
- international_law.json
- medical_genetics.json
- high_school_microeconomics.json
- high_school_biology.json
- professional_accounting.json
- marketing.json
- high_school_chemistry.json
- human_sexuality.json
- jurisprudence.json
- high_school_world_history.json
- high_school_macroeconomics.json
- college_medicine.json
- human_aging.json
- business_ethics.json
- abstract_algebra.json
- formal_logic.json
- prehistory.json
- moral_disputes.json
- econometrics.json
- college_computer_science.json
- college_biology.json
- sociology.json
- college_chemistry.json
- computer_security.json
- logical_fallacies.json
- virology.json
- college_physics.json
- us_foreign_policy.json
- public_relations.json
- high_school_psychology.json
- security_studies.json
- global_facts.json
- elementary_mathematics.json
- anatomy.json
- astronomy.json
- miscellaneous.json
- professional_medicine.json
- high_school_physics.json
- conceptual_physics.json
- electrical_engineering.json
- high_school_european_history.json
- philosophy.json
- high_school_government_and_politics.json
- college_mathematics.json
- nutrition.json
- high_school_mathematics.json
- machine_learning.json
- world_religions.json
- high_school_us_history.json
- clinical_knowledge.json
- high_school_statistics.json
- high_school_computer_science.json
- high_school_geography.json
- amazon_mass
- am_ET.json
- ja_JP.json
- da_DK.json
- sv_SE.json
- fi_FI.json
- zh_CN.json
- id_ID.json
- jv_ID.json
- tl_PH.json
- af_ZA.json
- he_IL.json
- az_AZ.json
- all.json
- ca_ES.json
- ur_PK.json
- de_DE.json
- hy_AM.json
- my_MM.json
- ka_GE.json
- zh_TW.json
- ro_RO.json
- te_IN.json
- fr_FR.json
- tr_TR.json
- km_KH.json
- ml_IN.json
- th_TH.json
- mn_MN.json
- all_1
- lv_LV.json
- is_IS.json
- ms_MY.json
- ar_SA.json
- nl_NL.json
- pt_PT.json
- sw_KE.json
- hu_HU.json
- hi_IN.json
- sl_SL.json
- cy_GB.json
- fa_IR.json
- es_ES.json
- ru_RU.json
- vi_VN.json
- bn_BD.json
- el_GR.json
- en_US.json
- kn_IN.json
- sq_AL.json
- ta_IN.json
- it_IT.json
- nb_NO.json
- ko_KR.json
- pl_PL.json
- reuters21578
- argument_topic.json
- rte.json
- unfair_tos.json
- sciq.json
- financial_tweets.json
- openbook_qa.json
- wmt_en_de.json
- wnli.json
- ag_news.json
- race_middle.json
- cola.json
- head_qa
- cnn_dailymail.json
- ethos_binary.json
- ai2_arc
- medical_abstracts.json
- stsb.json
- claim_stance_topic.json
- qqp.json
- almostEvilML_qa.json
- mrpc.json
- qnli.json
- race_all.json
- law_stack_exchange.json
- piqa_all.json
- wmt_en_fr.json
- belebele
- acm_Arab.json
- shn_Mymr.json
- pes_Arab.json
- srp_Cyrl.json
- mya_Mymr.json
- guj_Gujr.json
- sin_Sinh.json
- tam_Taml.json
- dan_Latn.json
- lin_Latn.json
- kaz_Cyrl.json
- mal_Mlym.json
- ceb_Latn.json
- kan_Knda.json
- ars_Arab.json
- hye_Armn.json
- ben_Latn.json
- mkd_Cyrl.json
- sot_Latn.json
- arz_Arab.json
- plt_Latn.json
- hin_Latn.json
- afr_Latn.json
- ckb_Arab.json
- hau_Latn.json
- fra_Latn.json
- lit_Latn.json
- mlt_Latn.json
- asm_Beng.json
- kir_Cyrl.json
- zul_Latn.json
- tir_Ethi.json
- nya_Latn.json
- snd_Arab.json
- ukr_Cyrl.json
- nob_Latn.json
- tgl_Latn.json
- nld_Latn.json
- swe_Latn.json
- zho_Hant.json
- lvs_Latn.json
- als_Latn.json
- ory_Orya.json
- fin_Latn.json
- ibo_Latn.json
- sin_Latn.json
- isl_Latn.json
- lug_Latn.json
- cat_Latn.json
- bod_Tibt.json
- slv_Latn.json
- kac_Latn.json
- zho_Hans.json
- mri_Latn.json
- ssw_Latn.json
- som_Latn.json
- ary_Arab.json
- sna_Latn.json
- npi_Deva.json
- nso_Latn.json
- apc_Arab.json
- kin_Latn.json
- tgk_Cyrl.json
- ita_Latn.json
- hat_Latn.json
- wol_Latn.json
- tel_Telu.json
- xho_Latn.json
- azj_Latn.json
- heb_Hebr.json
- pan_Guru.json
- uzn_Latn.json
- pol_Latn.json
- luo_Latn.json
- tha_Thai.json
- vie_Latn.json
- grn_Latn.json
- sun_Latn.json
- tur_Latn.json
- amh_Ethi.json
- fuv_Latn.json
- rus_Cyrl.json
- deu_Latn.json
- arb_Arab.json
- urd_Latn.json
- ilo_Latn.json
- ces_Latn.json
- swh_Latn.json
- bul_Cyrl.json
- tsn_Latn.json
- jav_Latn.json
- bam_Latn.json
- por_Latn.json
- urd_Arab.json
- gaz_Latn.json
- hun_Latn.json
- hrv_Latn.json
- yor_Latn.json
- jpn_Jpan.json
- hin_Deva.json
- lao_Laoo.json
- eus_Latn.json
- est_Latn.json
- kor_Hang.json
- ben_Beng.json
- kat_Geor.json
- slk_Latn.json
- ell_Grek.json
- pbt_Arab.json
- khk_Cyrl.json
- arb_Latn.json
- eng_Latn.json
- war_Latn.json
- tso_Latn.json
- kea_Latn.json
- ind_Latn.json
- zsm_Latn.json
- mar_Deva.json
- spa_Latn.json
- npi_Latn.json
- ron_Latn.json
- khm_Khmr.json
- wmt_en_ro.json
- winogrande
- wmt
- xlsum
- gujarati.json
- chinese_traditional.json
- vietnamese.json
- nepali.json
- pashto.json
- russian.json
- korean.json
- punjabi.json
- thai.json
- persian.json
- serbian_cyrillic.json
- kirundi.json
- indonesian.json
- serbian_latin.json
- turkish.json
- oromo.json
- welsh.json
- burmese.json
- hausa.json
- urdu.json
- bengali.json
- portuguese.json
- french.json
- marathi.json
- ukrainian.json
- tigrinya.json
- azerbaijani.json
- tamil.json
- igbo.json
- swahili.json
- hindi.json
- yoruba.json
- chinese_simplified.json
- somali.json
- english.json
- japanese.json
- uzbek.json
- arabic.json
- telugu.json
- sinhala.json
- kyrgyz.json
- amharic.json
- pidgin.json
- spanish.json
- scottish_gaelic.json
- clinc_oos
- sst2.json
- squad.json
- piqa_high.json
- mnli.json
- boolq.json
- openbookQA.json
- xnli
- copa.json
- wsc.json
- piqa.json
- dbpedia_14.json
- xwinogrande
- ledgar.json
- race_high.json
- piqa_middle.json
- instructions
- formats
- tasks
- metrics
- sentence_bert
- f1_weighted.json
- reward
- token_overlap.json
- wer.json
- rouge_with_confidence_intervals.json
- f1_micro_multi_label.json
- precision_micro_multi_label.json
- spearman.json
- kpa.json
- token_overlap_with_context.json
- char_edit_dist_accuracy.json
- precision_macro_multi_label.json
- rouge.json
- ner.json
- accuracy.json
- retrieval_at_k.json
- string_containment.json
- f1_macro.json
- recall_micro_multi_label.json
- ndcg.json
- sacrebleu.json
- squad.json
- mrr.json
- map.json
- f1_macro_multi_label.json
- normalized_sacrebleu.json
- bleu.json
- bert_score
- matthews_correlation.json
- recall_macro_multi_label.json
- f1_micro.json
- augmentors
- benchmarks
- splitters
- processors
- first_character.json
- to_string.json
- to_list_by_comma.json
- load_json.json
- list_to_empty_entity_tuples.json
- to_yes_or_none.json
- stance_to_pro_con.json
- hate_speech_or_not_hate_speech.json
- yes_no_to_int.json
- string_or_not_hate.json
- take_first_non_empty_line.json
- toxic_or_not_toxic.json
- lower_case_till_punc.json
- to_span_label_pairs_surface_only.json
- to_pairs.json
- take_first_word.json
- convert_to_boolean.json
- dict_of_lists_to_value_key_pairs.json
- lower_case.json
- to_span_label_pairs.json
- to_string_stripped.json