Here you will find complete coding samples showing how to perform different tasks using Unitxt. Each example comes with a self contained python file that you can run and later modify.
Demonstrates how to evaluate an existing entailment dataset using Unitxt. Unitxt is used to load the dataset, generate the input to the model, run inference and evaluate the results.
Related documentation: :ref:`Installation <install_unitxt>` , :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`, :ref:`Inference Engines <inference>`.
This example demonstrates how to evaluate a user QA answering dataset in a standalone file using a user-defined task and template.
Related documentation: :ref:`Add new dataset tutorial <adding_dataset>`.
This example demonstrates how to evaluate a user QA dataset using the predefined open qa task and templates. It also shows how to use preprocessing steps to align the raw input of the dataset with the predefined task fields.
Related documentation: :ref:`Add new dataset tutorial <adding_dataset>`, :ref:`Open QA task in catalog <catalog.tasks.qa.open>`, :ref:`Open QA template in catalog <catalog.templates.qa.open.title>`, :ref:`Inference Engines <inference>`.
These examples demonstrate how to evaluate a datasets of different tasks when predictions are already available and no inference is required.
Example code for classification task
Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`
This example demonstrates how to evaluate a named entity recognition task. The ground truth entities are provided as spans within the provided texts, and the model is prompted to identify these entities. Classifical f1_micro, f1_macro, and per-entity-type f1 metrics are reported.
Related documentation: :ref:`Add new dataset tutorial <adding_dataset>`, :ref:`NER task in catalog <catalog.tasks.ner.all_entity_types>`, :ref:`Inference Engines <inference>`.
This example demonstrates how different templates and the number of in-context learning examples impacts the performance of a model on an entailment task. It also shows how to register assets into a local catalog and reuse them.
Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`, :ref:`Inference Engines <inference>`.
This example demonstrates how different formats and system prompts affect the input provided to a llama3 chat model and evaluate their impact on the obtained scores.
Related documentation: :ref:`Formatting tutorial <adding_format>`.
This example demonstrates how different methods of selecting the demonstrations in in-context learning affect the results. Three methods are considered: fixed selection of example demonstrations for all test instances, random selection of example demonstrations for each test instance, and choosing the demonstration examples most (lexically) similar to each test instance.
Related documentation: :ref:`Formatting tutorial <adding_format>`.
This example demonstrates how to evaluate a dataset using a pool of templates and a varying number of in-context learning demonstrations. It shows how to sample a template and specify the number of demonstrations for each instance from predefined lists.
Related documentation: :ref:`Templates tutorial <adding_template>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`, :ref:`Inference Engines <inference>`.
This example explores the effect of long context in classification. It converts a standard multi class classification dataset (sst2 sentiment classification), where single sentence texts are classified one by one, to a dataset where multiple sentences are classified using a single LLM call. It compares the f1_micro in both approaches on two models. It uses serializers to verbalize and enumerated list of multiple sentences and labels.
Related documentation: :ref:`Sst2 dataset card in catalog <catalog.cards.sst2>` :ref:`Types and Serializers Guide <types_and_serializers>`.
This example shows how to construct a benchmark that includes multiple datasets, each with a specific template. It demonstrates how to use these templates to evaluate the datasets and aggregate the results to obtain a final score. This approach provides a comprehensive evaluation across different tasks and datasets.
Related documentation: :ref:`Benchmarks tutorial <adding_benchmark>`, :ref:`Formatting tutorial <adding_format>`, :ref:`Using the Catalog <using_catalog>`, :ref:`Inference Engines <inference>`.
This example demonstrates how to use LLM-as-a-Judge with a predefined criteria, in this case answer_relevance. The unitxt catalog has more than 40 predefined criteria for direct evaluators.
Related documentation: :ref:`Using LLM as a Judge in Unitxt <llm_as_judge>`
The user can also specify a bespoke criteria that the judge model uses as a guide to evaluate the responses. This example demonstrates how to use LLM-as-a-Judge with a user-defined criteria. The criteria must have options and option_map.
Related documentation: :ref:`Creating a custom criteria`
This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefined criteria for direct evaluation. Note that here we also showcase unitxt's ability to evaluate the dataset on multiple criteria, namely, answer_relevance, coherence and conciseness
Related documentation: :ref:`End to end Direct example`
This example demonstrates how to use LLM-as-a-Judge for pairwise comparison using a predefined criteria from the catalog. The unitxt catalog has 7 predefined criteria for pairwise evaluators. We also showcase that the criteria does not need to be the same across the entire dataset and that the framework can handle different criteria for each datapoint.
This example demonstrates using LLM-as-a-Judge for pairwise comparison using a single predefined criteria for the entire dataset
This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefined criteria for pairwise evaluation. Note that here we also showcase unitxt's ability to evaluate the dataset on multiple criteria, namely, answer_relevance, coherence and conciseness
Related documentation: :ref:`End to end Pairwise example`
This example demonstrates how to use the standard Unitxt RAG response generation task. The response generation task is the following: Given a question and one or more context(s), generate an answer that is correct and faithful to the context(s). The example shows how to map the dataset input fields to the RAG response task fields and use the existing metrics to evaluate model results.
Related documentation: :ref:`RAG Guide <rag_support>`, :ref:`Response generation task <catalog.tasks.rag.response_generation>`, :ref:`Inference Engines <inference>`.
This example demonstrates how to evaluate an end to end RAG system, given that the RAG system outputs are available.
Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`
This example demonstrates how to evaluate an image-text to text model using Unitxt. The task involves generating text responses based on both image and text inputs. This is particularly useful for tasks like visual question answering (VQA) where the model needs to understand and reason about visual content to answer questions. The example shows how to:
- Load a pre-trained image-text model (LLaVA in this case)
- Prepare a dataset with image-text inputs
- Run inference on the model
- Evaluate the model's predictions
The code uses the document VQA dataset in English, applies a QA template with context, and formats it for the LLaVA model. It then selects a subset of the test data, generates predictions, and evaluates the results. This approach can be adapted for various image-text to text tasks, such as image captioning, visual reasoning, or multimodal dialogue systems.
Related documentation: :ref:`Multi-Modality Guide <multi_modality>`, :ref:`Inference Engines <inference>`.
Evaluate Image-Text to Text Models with different templates and explore the sensitivity of the model to different textual variations.
Related documentation: :ref:`Multi-Modality Guide <multi_modality>`, :ref:`Inference Engines <inference>`.
This example demonstrates how to evaluate an image key value extraction task. It renders several images of given texts and then prompts a vision model to extract key value pairs from the images. This requires the vision model to understand the texts in the images, and extract relevant values. It computes overall F1 scores and F1 scores for each of the keys based on ground truth key value pairs. Note the same code can be used for textual key value extraction, just py providing input texts instead of input images.
Related documentation: :ref:`Key Value Extraction task in catalog <catalog.tasks.key_value_extraction>`, :ref:`Inference Engines <inference>`. :ref:`Multi-Modality Guide <multi_modality>`, :ref:`Inference Engines <inference>`.
This example show how to define new data types as well as the way these data type should be handled when processed to text.
Related documentation: :ref:`Types and Serializers Guide <types_and_serializers>`, :ref:`Inference Engines <inference>`.
This example demonstrates how to evaluate an existing entailment dataset (wnli) using HuggingFace Datasets and Evaluate APIs, with no installation required.
Related documentation: :ref:`Evaluating datasets <evaluating_datasets>`, :ref:`WNLI dataset card in catalog <catalog.cards.wnli>`, :ref:`Relation template in catalog <catalog.templates.classification.multi_class.relation.default>`, :ref:`Inference Engines <inference>`.