Descriptive Image Text


Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking.








How would you like to start?











Why Unitxt?


Unitxt was built by IBM Research to host a maintainable large collection of evaluation assets. If you care about robust evaluation that last, then Unitxt is for you.














Unitxt catalog is a one stop shop containing well documented assets constructing robust evaluation pipelines such as: Task Instructions, Data Loaders, Data Types Serializers and Inference Engines


Evaluation Tasks
64
LLM Ready
Datasets
3,174
Prompts
342
Metrics
462
Custom
Benchmarks
6












End to End evaluation made simple



unitxt_example.py
1 from unitxt import evaluate, create_dataset 2 from unitxt.blocks import Task, InputOutputTemplate 3 from unitxt.inference import HFAutoModelInferenceEngine 4 5 # Question-answer dataset 6 data = [ 7 {"question": "What is the capital of Texas?", "answer": "Austin"}, 8 {"question": "What is the color of the sky?", "answer": "Blue"}, 9 ] 10 11 # Define the task and evaluation metric 12 task = Task( 13 input_fields={"question": str}, 14 reference_fields={"answer": str}, 15 prediction_type=str, 16 metrics=["metrics.accuracy"], 17 ) 18 19 # Create a template to format inputs and outputs 20 template = InputOutputTemplate( 21 instruction="Answer the following question.", 22 input_format="{question}", 23 output_format="{answer}", 24 postprocessors=["processors.lower_case"], 25 ) 26 27 # Prepare the dataset 28 dataset = create_dataset( 29 task=task, 30 template=template, 31 format="formats.chat_api", 32 test_set=data, 33 split="test", 34 ) 35 36 # Set up the model (supports Hugging Face, WatsonX, OpenAI, etc.) 37 model = HFAutoModelInferenceEngine( 38 model_name="Qwen/Qwen1.5-0.5B-Chat", max_new_tokens=32 39 ) 40 41 # Generate predictions and evaluate 42 predictions = model(dataset) 43 results = evaluate(predictions=predictions, data=dataset) 44 45 # Print results 46 print("Global Results:\n", results.global_scores.summary) 47 print("Instance Results:\n", results.instance_scores.summary)

Welcome!ΒΆ