Skip to content

Releases: IBM/unitxt

Unitxt 1.23.0

13 May 13:10
fd97309

Choose a tag to compare

Main changes

  1. Revised the tool calling tasks and metrics introduced in 1.22.4) - Non backward compatible change. Existing datasets addressed.
  2. Fixed support for running HF with AutoModelInferenceEngine (MultiGPU + tokenization issue)
  3. Added to_yaml() to create a yaml representation of the card that can be used for running custom datasets in Granite.build

What's Changed

New Contributors

Full Changelog: 1.22.4...1.23.0

Unitxt 1.22.4

04 May 10:46
831e535

Choose a tag to compare

What's Changed

  • Add comprehensive support for tool calling + Berkley Tool Calling Benchmark by @elronbandel in #1762
  • Add tool calling support + Berekley Tool Calling Benchmark (simple-v3) by @elronbandel in #1764
  • Remove the rename from test to train by @BenjSz in #1759
  • trying to fix PERFORMANCE: use a github repository as a replacement for the gone HF 'lmsys/arena-hard-browser' by @dafnapension in #1757
  • Update version to 1.22.4 by @elronbandel in #1766

Full Changelog: 1.22.3...1.22.4

Unitxt 1.22.3

27 Apr 11:34
4d14dc2

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 1.22.2...1.22.3

Unitxt 1.22.2

16 Apr 07:03
31e0e61

Choose a tag to compare

What's Changed

Full Changelog: 1.22.1...1.22.2

Unitxt 1.22.1

09 Apr 12:41
85e478e

Choose a tag to compare

What's Changed

Full Changelog: 1.22.0...1.22.1

Unitxt 1.22.0

06 Apr 13:47
1d0596f

Choose a tag to compare

Main changes

  • Support HFPipelineBasedInferenceEngine with PEFT model by @eladven in #1701

Catalog changes

CI/CD

Other

Full Changelog: 1.21.0...1.22.0

Unitxt 1.21.0

19 Mar 18:07
4aad1e0

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 1.20.0...1.21.0

Unitxt 1.20.0

09 Mar 08:49
f1ca43a

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 1.19.0...1.20.0

Unitxt 1.19.0

25 Feb 17:00
2b216f7

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 1.18.0...1.19.0

Unitxt 1.18.0 - Faster Loading

04 Feb 14:20
2ef9091

Choose a tag to compare

The main improvements in this version focus on caching strategies, dataset loading, and speed optimizations.

Hugging Face Datasets Caching Policy

We have completely revised our caching policy and how we handle Hugging Face datasets in order to improve performance.

  1. Hugging Face datasets are now cached by default.

This means that LoadHF loader will cache the downloaded datasets in the HF cache directory (typically ~/.cache/huggingface/datasets).

  • To disable this caching mechanism, use:
unitxt.settings.disable_hf_datasets_cache = True
  1. All Hugging Face datasets are first downloaded and then processed.

    • This means the entire dataset is downloaded, which is faster for most datasets. However, if you want to process a huge dataset, and the HF dataset supports streaming, you can load it in streaming mode
    LoadHF(name="my-dataset", streaming=True)
  2. To enable streaming mode by default for all Hugging Face datasets, use:

    unitxt.settings.stream_hf_datasets_by_default = True

While the new defaults (full download & caching) may make the initial dataset load slower, subsequent loads will be significantly faster.

Unitxt Datasets Caching Policy

By default, when loading datasets with unitxt.load_dataset, the dataset is prepared from scratch each time you call the function.
This ensures that any changes made to the card definition are reflected in the output.

  • This process may take a few seconds, and for large datasets, repeated loading can accumulate overhead.

  • If you are using fixed datasets from the catalog, you can enable caching for Unitxt datasets and thus cache the unitxt datasets.
    The datasets are cached in the huggingface cache (typically ~/.cache/huggingface/datasets).

    from unitxt import load_dataset
    
    ds = load_dataset(card="my_card", use_cache=True)

Faster Unitxt Dataset Preparation

To improve dataset loading speed, we have optimized how Unitxt datasets are prepared.

Background:

Unitxt datasets are converted to Hugging Face datasets because they store data on disk while keeping only the necessary parts in memory (via PyArrow). This enables efficient handling of large datasets without excessive memory usage.

Previously, unitxt.load_dataset used built-in Hugging Face methods for dataset preparation, which included unnecessary type handling and verification, slowing down the process.

Key improvements:

  • We now create the Hugging Face dataset directly, reducing preparation time by almost 50%.
  • With this optimization, Unitxt datasets are now faster than ever!

What's Changed

New Contributors

Full Changelog: 1.17.0...1.18.0