Skip to content

Commit 4659b60

Browse files
authored
Merge branch 'main' into add_completeness_judge
2 parents 76e7502 + 9777799 commit 4659b60

File tree

906 files changed

+20096
-3864
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

906 files changed

+20096
-3864
lines changed

.github/workflows/docs.yml

+3-3
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ on:
99
concurrency:
1010
group: ${{ github.workflow }}-${{ github.event_name == 'pull_request' && github.event.pull_request.number || github.ref_name }}
1111
cancel-in-progress: true
12-
12+
1313
jobs:
1414
docs:
1515

@@ -23,10 +23,10 @@ jobs:
2323

2424
- uses: actions/setup-python@v5
2525
with:
26-
python-version: '3.9'
26+
python-version: '3.8'
2727

2828
- run: curl -LsSf https://astral.sh/uv/install.sh | sh
29-
- run: uv pip install --system ".[tests,docs]"
29+
- run: uv pip install --system ".[docs]"
3030

3131
- name: Compile Docs
3232
run: make docs

.github/workflows/library_tests.yml

+5-3
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,9 @@ jobs:
3434
- run: pip install coverage[toml]
3535

3636
- name: Run Tests
37-
run: coverage run --omit=*/preparation -m unittest discover -s tests/library -p "test_*.py"
37+
run: coverage run -m unittest discover -s tests/library -p "test_*.py"
3838

39-
- name: Upload Coverage to Codecov
40-
uses: codecov/codecov-action@v2
39+
- run: coverage report
40+
41+
- name: Upload Coverage to Coveralls
42+
uses: coverallsapp/github-action@v2

.github/workflows/performance.yml

+15-12
Original file line numberDiff line numberDiff line change
@@ -17,32 +17,35 @@ jobs:
1717
env:
1818
OS: ubuntu-latest
1919
UNITXT_DEFAULT_VERBOSITY: error
20+
UNITXT_MOCK_INFERENCE_MODE: "True"
2021
DATASETS_VERBOSITY: error
2122
HF_HUB_VERBOSITY: error
2223
HF_DATASETS_DISABLE_PROGRESS_BARS: "True"
2324
TQDM_DISABLE: "True"
24-
2525
steps:
2626
- uses: actions/checkout@v4
2727

2828
- uses: actions/setup-python@v5
2929
with:
30-
python-version: '3.9'
30+
python-version: '3.10'
3131

3232
- name: Install Requirements
3333
run: |
3434
curl -LsSf https://astral.sh/uv/install.sh | sh
35-
uv pip install --system -e ".[tests]"
35+
uv pip install --system ".[tests,watsonx,inference-tests]"
36+
uv pip install --system litellm
37+
uv pip install --system diskcache
38+
huggingface-cli login --token ${{ secrets.UNITXT_READ_HUGGINGFACE_HUB_FOR_TESTS }}
3639
3740
- name: Prepare the dirs for performance evaluation in main
3841
run: |
3942
mkdir -p performance_action
40-
mkdir -p performance_action/logs
41-
echo "" > performance_action/__init__.py
42-
echo " " > performance_action/logs/cards_benchmark.prof
43-
echo " " > performance_action/logs/cards_benchmark.json
44-
cp performance/card_profiler.py performance_action/card_profiler.py
45-
cp performance/compare_performance_results.py performance_action/compare_performance_results.py
43+
cp performance/bluebench_profiler.py performance_action/bluebench_profiler.py
44+
cp performance/compare_benchmark_performance_results.py performance_action/compare_benchmark_performance_results.py
45+
46+
- name: Run performance on PR just to warm the cache, output will be overwritten
47+
run : |
48+
python performance_action/bluebench_profiler.py --output_file performance_action/pr_results.json
4649
4750
- name: Checkout main branch
4851
uses: actions/checkout@v4
@@ -52,7 +55,7 @@ jobs:
5255

5356
- name: Run performance on main branch
5457
run: |
55-
python performance_action/card_profiler.py --output_file performance_action/main_results.json
58+
python performance_action/bluebench_profiler.py --output_file performance_action/main_results.json
5659
5760
- name: Checkout PR branch
5861
uses: actions/checkout@v4
@@ -62,8 +65,8 @@ jobs:
6265

6366
- name: Run performance on PR branch
6467
run: |
65-
python performance_action/card_profiler.py --output_file performance_action/pr_results.json
68+
python performance_action/bluebench_profiler.py --output_file performance_action/pr_results.json
6669
6770
- name: Compare main and PR performance results
6871
run: |
69-
python performance_action/compare_performance_results.py performance_action/main_results.json performance_action/pr_results.json >> $GITHUB_STEP_SUMMARY
72+
python performance_action/compare_benchmark_performance_results.py performance_action/main_results.json performance_action/pr_results.json >> $GITHUB_STEP_SUMMARY

.github/workflows/test_helm.yml

+4-2
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,10 @@ jobs:
2222
- uses: actions/setup-python@v5
2323
with:
2424
python-version: '3.9'
25-
cache: 'pip' # caching pip dependencies
26-
- run: pip install --upgrade 'crfm-helm[unitxt]>=0.5.3'
25+
26+
- run: curl -LsSf https://astral.sh/uv/install.sh | sh
27+
- run: uv pip install --upgrade --system "crfm-helm[unitxt]>=0.5.3"
28+
- run: uv pip install --system "scikit-learn==1.5.2"
2729

2830
- name: Test Helm
2931
run: utils/run_helm.sh

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -157,3 +157,5 @@ src/unitxt/catalog/processors/example/to_string.json
157157
prod_env/*
158158
benchmark_output/*
159159
.litellm_cache
160+
161+
docs/_static/data.js

.pre-commit-config.yaml

-5
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,6 @@ repos:
1010
args: [--fix]
1111
exclude: src/unitxt/metrics.py|examples/evaluate_existing_dataset_no_install.py
1212
# Run the linter on the specific file with the ignore flag
13-
- id: ruff
14-
name: ruff (src/unitxt/metrics.py)
15-
files: src/unitxt/metrics.py
16-
args: [--fix, --ignore, C901]
17-
# Run the linter on the specific file with the ignore flag
1813
- id: ruff
1914
name: ruff (examples/evaluate_existing_dataset_no_install.py)
2015
files: examples/evaluate_existing_dataset_no_install.py

README.md

+37-56
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ In the dynamic landscape of generative NLP, traditional text processing pipeline
2121
![license](https://img.shields.io/github/license/ibm/unitxt)
2222
![python](https://img.shields.io/badge/python-3.8%20|%203.9-blue)
2323
![tests](https://img.shields.io/github/actions/workflow/status/ibm/unitxt/library_tests.yml?branch=main&label=tests)
24-
[![codecov](https://codecov.io/gh/IBM/unitxt/branch/main/graph/badge.svg?token=mlrWq9cwz3)](https://codecov.io/gh/IBM/unitxt)
24+
[![Coverage Status](https://coveralls.io/repos/github/IBM/unitxt/badge.svg)](https://coveralls.io/github/IBM/unitxt)
2525
![Read the Docs](https://img.shields.io/readthedocs/unitxt)
2626
[![downloads](https://static.pepy.tech/personalized-badge/unitxt?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/unitxt)
2727

@@ -48,80 +48,61 @@ Then launch the ui by running:
4848
unitxt-explore
4949
```
5050

51-
# 🦄 Example
51+
# 🦄 Example
5252

5353
This is a simple example of running end-to-end evaluation in self contained python code over user data.
5454

5555
See more examples in examples subdirectory.
5656

5757
```python
58-
from unitxt import get_logger
59-
from unitxt.api import evaluate, load_dataset
60-
from unitxt.blocks import Task, TaskCard
61-
from unitxt.inference import HFPipelineBasedInferenceEngine
62-
from unitxt.loaders import LoadFromDictionary
63-
from unitxt.templates import InputOutputTemplate, TemplatesDict
64-
from unitxt.text_utils import print_dict
65-
66-
logger = get_logger()
67-
68-
# Set up question answer pairs in a dictionary
69-
data = {
70-
"test": [
71-
{"question": "What is the capital of Texas?", "answer": "Austin"},
72-
{"question": "What is the color of the sky?", "answer": "Blue"},
73-
]
74-
}
75-
76-
card = TaskCard(
77-
# Load the data from the dictionary. Data can be also loaded from HF, CSV files, COS and other sources using different loaders.
78-
loader=LoadFromDictionary(data=data),
79-
# Define the QA task input and output and metrics.
80-
task=Task(
81-
input_fields={"question": str},
82-
reference_fields={"answer": str},
83-
prediction_type=str,
84-
metrics=["metrics.accuracy"],
85-
),
58+
# Import required components
59+
from unitxt import evaluate, create_dataset
60+
from unitxt.blocks import Task, InputOutputTemplate
61+
from unitxt.inference import HFAutoModelInferenceEngine
62+
63+
# Question-answer dataset
64+
data = [
65+
{"question": "What is the capital of Texas?", "answer": "Austin"},
66+
{"question": "What is the color of the sky?", "answer": "Blue"},
67+
]
68+
69+
# Define the task and evaluation metric
70+
task = Task(
71+
input_fields={"question": str},
72+
reference_fields={"answer": str},
73+
prediction_type=str,
74+
metrics=["metrics.accuracy"],
8675
)
8776

88-
# Create a simple template that formats the input.
89-
# Add lowercase normalization as a post processor on the model prediction.
90-
77+
# Create a template to format inputs and outputs
9178
template = InputOutputTemplate(
9279
instruction="Answer the following question.",
9380
input_format="{question}",
9481
output_format="{answer}",
9582
postprocessors=["processors.lower_case"],
9683
)
97-
# Verbalize the dataset using the template
98-
dataset = load_dataset(card=card, template=template)
99-
test_dataset = dataset["test"]
10084

85+
# Prepare the dataset
86+
dataset = create_dataset(
87+
task=task,
88+
template=template,
89+
format="formats.chat_api",
90+
test_set=data,
91+
split="test",
92+
)
10193

102-
# Infer using flan t5 base using HF API
103-
# can be replaced with any prediction code,
104-
# including the built in WMLInferenceEngine and OpenAiInferenceEngine.
105-
model_name = "google/flan-t5-base"
106-
inference_model = HFPipelineBasedInferenceEngine(
107-
model_name=model_name, max_new_tokens=32
94+
# Set up the model (supports Hugging Face, WatsonX, OpenAI, etc.)
95+
model = HFAutoModelInferenceEngine(
96+
model_name="Qwen/Qwen1.5-0.5B-Chat", max_new_tokens=32
10897
)
109-
predictions = inference_model.infer(test_dataset)
110-
evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)
11198

112-
# Print results
113-
for instance in evaluated_dataset:
114-
print_dict(
115-
instance,
116-
keys_to_print=[
117-
"source", # input to the model
118-
"prediction", # model prediction
119-
"processed_prediction", # model prediction after post processing
120-
"references", # reference answer
121-
"score", # scores (per instance and global)
122-
],
123-
)
99+
# Generate predictions and evaluate
100+
predictions = model(dataset)
101+
results = evaluate(predictions=predictions, data=dataset)
124102

103+
# Print results
104+
print("Global Results:\n", results.global_scores.summary)
105+
print("Instance Results:\n", results.instance_scores.summary)
125106
```
126107

127108
# 🦄 Contributors

assets/banner.png

26.4 KB
Loading

docs/_static/banner.png

26.4 KB
Loading

docs/blog/inference_engines_blog.rst

+6-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
.. title:: Unitxt Embraces Rich Chat Format and Cross API Inference: Simplifying LLM Evaluation
2-
.. authors:: Elron Bandel
3-
.. date:: 2024-11-19
2+
3+
:Authors: Elron Bandel
4+
5+
:Date: 2024-11-19
46

57
=================================================================================================
68
[19/11/2024] Unitxt Embraces Rich Chat Format and Cross API Inference: Simplifying LLM Evaluation
@@ -21,8 +23,8 @@ Introducing Two Major Enhancements
2123
-----------------------------------
2224

2325
1. **Producing Data in Chat API Format**
24-
Unitxt now can produces data in the widely adopted Chat API format.
25-
This ensures compatibility with popular LLM Provider APIs and avoid the need from custom per model formatting.
26+
Unitxt can produce data in the widely adopted Chat API format.
27+
This ensures compatibility with popular LLM Provider APIs and avoid the need for custom per model formatting.
2628
Additionally, the format supports multiple modalities such as text, images, and videos.
2729

2830
2. **A Comprehensive Array of Inference Engines**

docs/blog/vision_robustness_blog.rst

+13-7
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,14 @@
11
.. title:: If Your LLM sees White Noise, Try Asking Differently: Revealing AI’s Text and Image Sensitivities with Unitxt
2-
.. sectionauthor:: Elron Bandel and Nimrod Shabtay
3-
.. date:: 2024-11-01
42

5-
============================
3+
:Authors:
4+
Elron Bandel
5+
Nimrod Shabtay
6+
7+
:Date: 2024-11-01
8+
9+
==========================================================================================================================
610
[01/11/2024] If Your LLM sees White Noise, Try Asking Differently: Revealing AI’s Text and Image Sensitivities with Unitxt
7-
============================
11+
==========================================================================================================================
812

913
**Authors**: Elron Bandel and Nimrod Shabtay
1014

@@ -35,7 +39,7 @@ Here’s the code used to set up our tests. This example uses Unitxt to create s
3539
for card in ["cards.seed_bench", "cards.ai2d"]:
3640
for enumerator in ["capitals", "lowercase"]:
3741
for augmentor in [None, "augmentors.image.white_noise"]:
38-
subsets[f"{card} {enumerator} {augmentor}"] = StandardRecipe(
42+
subsets[f"{card} {enumerator} {augmentor}"] = DatasetRecipe(
3943
card=card,
4044
template=f"templates.qa.multiple_choice.with_context.lmms_eval[enumerator={enumerator}]",
4145
loader_limit=100,
@@ -46,15 +50,17 @@ Here’s the code used to set up our tests. This example uses Unitxt to create s
4650
4751
data = list(benchmark()["test"])
4852
49-
inference_model = LMMSEvalInferenceEngine(
53+
model = LMMSEvalInferenceEngine(
5054
model_type="llava_onevision",
5155
model_args={"pretrained": "lmms-lab/llava-onevision-qwen2-7b-ov"},
5256
max_new_tokens=2,
5357
)
5458
55-
predictions = inference_model.infer(data)
59+
predictions = model(data)
5660
results = evaluate(predictions=predictions, data=data)
5761
62+
print(results.subsets_scores.summary)
63+
5864
In order to run this you will first have to install llms-eval library which might not work on mac.
5965

6066
*Full code example at:* https://github.com/IBM/unitxt/blob/main/examples/robustness_testing_for_vision_text_models.py

0 commit comments

Comments
 (0)