Releases: IBM/unitxt
Unitxt 1.23.0
Main changes
- Revised the tool calling tasks and metrics introduced in 1.22.4) - Non backward compatible change. Existing datasets addressed.
- Fixed support for running HF with AutoModelInferenceEngine (MultiGPU + tokenization issue)
- Added to_yaml() to create a yaml representation of the card that can be used for running custom datasets in Granite.build
What's Changed
- FIx batching support for hf Dataset in HFAutoModelInferenceEngine by @elronbandel in #1771
- Fix litellm inference without task_data by @elronbandel in #1772
- Added to_yaml shorthand function to artifact by @yoavkatz in #1768
- Simplify tool calling base types by @elronbandel in #1773
- Added tool calling to wml chat by @pawelknes in #1782
- Reverting to datasets=351 can solve problems in test catalog preparation by @dafnapension in #1784
- Update ibm wml engine #1775 by @MikolajCharchut in #1781
- Fix HF AutoModel tokenization issue with chat template + issue with multi GPU by @OfirArviv in #1779
- Performance to report accurate times based on end-to-end time() diffs, rather than accumulate cProfile numbers over methods whose names seem relevant by @dafnapension in #1783
- Add support to mix args and textual query in load_dataset by @elronbandel in #1778
- Add installation of spacy as a binary dependency for examples regression tests by @elronbandel in #1787
- Improvements to tool calling - NON BACKWARD COMPATIBLE CHANGES by @Narayanan-V-Eswar in #1770
- Added example for standalone metric evaluation by @yoavkatz in #1769
- Update version to 1.23.0 by @elronbandel in #1789
New Contributors
- @Narayanan-V-Eswar made their first contribution in #1770
Full Changelog: 1.22.4...1.23.0
Unitxt 1.22.4
What's Changed
- Add comprehensive support for tool calling + Berkley Tool Calling Benchmark by @elronbandel in #1762
- Add tool calling support + Berekley Tool Calling Benchmark (simple-v3) by @elronbandel in #1764
- Remove the rename from test to train by @BenjSz in #1759
- trying to fix PERFORMANCE: use a github repository as a replacement for the gone HF 'lmsys/arena-hard-browser' by @dafnapension in #1757
- Update version to 1.22.4 by @elronbandel in #1766
Full Changelog: 1.22.3...1.22.4
Unitxt 1.22.3
What's Changed
- Small fixes and touch ups in docs by @elronbandel in #1738
- Fix some docs styling by @elronbandel in #1739
- Remove more red lines from make docs-server, and rehab lost sons by @dafnapension in #1742
- Add instance ID to WML credentials by @pawelknes in #1741
- remove unnecessary llm-as-judge from scigen + max pred in benchmark by @ShirApp in #1746
- Support Byom for RITS inference engine. by @eladven in #1744
- Add GG and deepseek to rits in CrossInferenceEngine by @martinscooper in #1750
- Torr const max prediction by @ShirApp in #1752
- trying networkx 3.2.1 to remove TypeError: entry_points() got an unexpected keyword argument 'group' by @dafnapension in #1754
- Comment out SQL tests until fixed by @elronbandel in #1756
- fix typo in azure_openai_host variable name by @algadhib in #1753
- FIx tablebanch data analysis broken split by @elronbandel in #1749
- XSTest (and compliance criteria) by @bnayahu in #1751
- CrossProvider fails if model name doesn't exist in map by @martinscooper in #1747
- Fix relative imports in evaluate cli by @elronbandel in #1758
- Update version to 1.22.3 by @elronbandel in #1761
New Contributors
Full Changelog: 1.22.2...1.22.3
Unitxt 1.22.2
What's Changed
- Fix errors with peft inference engine implementation. by @eladven in #1721
- Allow overriding of CI method in GlobalMetric by @lga-zurich in #1722
- Mark issues stale after 30 days of no interaction. Closed them 15 days later. by @elronbandel in #1727
- Updated models on Replicate by @bnayahu in #1729
- Added example of multi-choice QA by @yoavkatz in #1634
- Add general formatter for chat api, with chat template based on the m… by @eladven in #1728
- Airbench by @bnayahu in #1730
- New Unitxt home page by @elronbandel in #1731
- Add catalog back to website by @elronbandel in #1734
- Disable litellm cache by @martinscooper in #1732
- Add a CLI for end-to-end evaluation by @perlitz in #1708
- Fix some tests by @elronbandel in #1735
- modify some doc-strings, thereby eliminating some red lines in make doc-server by @dafnapension in #1736
- Update version to 1.22.2 by @elronbandel in #1737
Full Changelog: 1.22.1...1.22.2
Unitxt 1.22.1
What's Changed
- Benjams/add watson x by @BenjSz in #1719
- remove metadata_fields in clapnq by @BenjSz in #1718
- Add full data_files support in HFLoader + tests by @elronbandel in #1724
Full Changelog: 1.22.0...1.22.1
Unitxt 1.22.0
Main changes
Catalog changes
- Update exact multiple choice template by @OfirArviv in #1698
- Vision templates by @alfassy in #1700
- Vision bench update by @alfassy in #1704
- update wml llmajj from llama-3-70b to llama-3-1-70b by @OfirArviv in #1703
- Fix Table Bench by @elronbandel in #1709
- Text2sql metrics fixes by @oktie in #1702
- Updates to the provoq card and related artifacts by @bnayahu in #1705
- Fix unitxt assistant context size control and update docs snapshot by @elronbandel in #1707
- Fix llm judge artifacts by @martinscooper in #1695
CI/CD
- Add timeout to github actions by @elronbandel in #1716
- Fix helm test by @elronbandel in #1697
- Add retry policy for huggingface assets downloads by @elronbandel in #1711
Other
- Add unitxt version to inference cache keys by @elronbandel in #1714
- Some touch ups to benchmarks by @elronbandel in #1715
Full Changelog: 1.21.0...1.22.0
Unitxt 1.21.0
What's Changed
- add 'show more' button for imports from unitxt modules by @dafnapension in #1651
- Update head qa dataset by @elronbandel in #1658
- Update few slow datasets by @elronbandel in #1663
- MLCommons AILuminate card and related artifacts by @bnayahu in #1662
- Granite guardian: add raw prompt to the result by @martinscooper in #1671
- Add positional bias summary to the response by @martinscooper in #1640
- Return float instead float32 in granite guardian metric by @martinscooper in #1669
- add qa template exact output by @OfirArviv in #1674
- LLM Judge: add prompts to the result by default by @martinscooper in #1670
- Safety eval updates by @bnayahu in #1668
- Add inference engine caching by @eladven in #1645
- BugFix: Handle cases where all sample scores are the same (yields nan) by @elronbandel in #1660
- CrossInferenceProvider: add more models by @martinscooper in #1676
- Implement get_engine_id were missing by @martinscooper in #1679
- Revisit base dependencies (specifically remove ipadic and absl-py) by @elronbandel in #1681
- Fix LoadHF.load_dataset() when mem-caching is off by @yhwang in #1683
- HFPipelineInferenceEngine - add loaded tokenizer to pipeline by @eladven in #1677
- Add default cache folder to .gitgnore by @martinscooper in #1687
- Fix a bug in loading without trust remote code by @elronbandel in #1684
- Add sacrebleu[ja] to test dependencies by @elronbandel in #1685
- Let evaluator name to be a string by @martinscooper in #1665
- Fix: AzureOpenAIInferenceEngine fails if api_version is not set by @martinscooper in #1680
- Fix some bugs in inference engine tests by @elronbandel in #1682
- Improved output message when using inference cache by @yoavkatz in #1686
- Changed API of Key Value Extraction task to use Dict and not List[Tuple] (NON BACKWARD COMPATIBLE CHANGE) by @yoavkatz in #1675
- Support for asynchronous requests for watsonx.ai chat by @pawelknes in #1666
- add tags information - url by @BenjSz in #1691
- Fixes to GraniteGuardian metric,, safety evals cleanups by @bnayahu in #1690
- Add docstring to LLMJudge classes by @martinscooper in #1652
- Remove src.lock by @elronbandel in #1692
- Text2sql metrics update and optional caching by @oktie in #1672
- Llm judge use cross provider by @martinscooper in #1673
- Improve LLM as Judge consistency by @martinscooper in #1688
- Update version to 1.21.0 by @elronbandel in #1693
New Contributors
Full Changelog: 1.20.0...1.21.0
Unitxt 1.20.0
What's Changed
- Fix unnecessary attempts in LoadCSV by @elronbandel in #1630
- Fix LLM as Judge direct criteria typo by @martinscooper in #1631
- Fix of typo in usage of attributes inside IntersectCorrespondingFields by @pklpriv in #1637
- Added MILU and Indic BoolQ Support by @murthyrudra in #1639
- Vision bench by @alfassy in #1641
- Add Granite Guardian evaluation on HF example by @martinscooper in #1638
- present catalog entries as pieces of python code by @dafnapension in #1643
- Example for evaluating system message leakage by @elronbandel in #1609
- Benjams/add hotpotqa + change type of metadata field to dict (non backward compatible) by @BenjSz in #1633
- removed the leftout break_point by @dafnapension in #1646
- Added Indic ARC Challenge Support by @murthyrudra in #1654
- Minor bug fix affecting Text2SQL execution accuracy by @oktie in #1657
- WMLInferenceEngineChat fixes by @pawelknes in #1656
- Update version to 1.20.0 by @elronbandel in #1659
New Contributors
- @murthyrudra made their first contribution in #1639
Full Changelog: 1.19.0...1.20.0
Unitxt 1.19.0
What's Changed
- Add RagBench datasets by @elronbandel in #1580
- Fix prompts table benchmark by @ShirApp in #1581
- Fix attempt to missing arrow dataset by @elronbandel in #1582
- Wml comp by @alfassy in #1578
- Key value extraction improvements by @yoavkatz in #1573
- fix: minor bug when only space id is provided for WML inference by @tsinggggg in #1583
- Try fixing csv loader by @elronbandel in #1586
- Fix failing tests by @elronbandel in #1589
- Fix tests by @elronbandel in #1590
- Fix metrics formatting and style by @elronbandel in #1591
- Fix bird dataset by @perlitz in #1593
- Use Lazy Loaders by @dafnapension in #1536
- Fix loading without limit by @elronbandel in #1594
- [Breaking change] Add support for all Granite Guardian risks by @martinscooper in #1576
- Added api call example by @yoavkatz in #1587
- Make MultipleSourceLoader lazy and fix its use of fusion by @elronbandel in #1602
- Prioritize using default templates from card over task by @elronbandel in #1596
- Use faster model for examples by @elronbandel in #1607
- Add clear and minimal settings documentation by @elronbandel in #1606
- Fix some tests by @elronbandel in #1610
- Add download and etag timeout settings to workflow configurations by @elronbandel in #1613
- Allow read timeout error in preparation tests by @elronbandel in #1615
- Fix Ollama inference engine by @eladven in #1611
- Add verify as an option to LoadFromAPI by @perlitz in #1608
- Added example of custom metric by @yoavkatz in #1616
- Granite guardian minor changes by @martinscooper in #1605
- add ragbench faithfulness cards by @lilacheden in #1598
- Update tables benchmark name to torr by @elronbandel in #1617
- Add CoT to LLM as judge assessments by @martinscooper in #1612
- Simplify preparation tests with better error handling by @elronbandel in #1618
- Text2sql execution accuracy metric updates by @oktie in #1604
- Fix Azure OpenAI based LLM judges by @martinscooper in #1619
- Add correctness_based_on_ground_truth criteria by @martinscooper in #1623
- Enable offline mode for hugginface by using local pre-downloaded metrics, datasets and models by @elronbandel in #1603
- Add provider specific args and allow using unrecognized model names by @elronbandel in #1621
- Start implementing assesment for unitxt assitant by @eladven in #1625
- small changes to profiler by @dafnapension in #1627
- Return MultiStream in lazy loaders to avoid copying by @elronbandel in #1628
New Contributors
- @tsinggggg made their first contribution in #1583
Full Changelog: 1.18.0...1.19.0
Unitxt 1.18.0 - Faster Loading
The main improvements in this version focus on caching strategies, dataset loading, and speed optimizations.
Hugging Face Datasets Caching Policy
We have completely revised our caching policy and how we handle Hugging Face datasets in order to improve performance.
- Hugging Face datasets are now cached by default.
This means that LoadHF loader will cache the downloaded datasets in the HF cache directory (typically ~/.cache/huggingface/datasets).
- To disable this caching mechanism, use:
unitxt.settings.disable_hf_datasets_cache = True-
All Hugging Face datasets are first downloaded and then processed.
- This means the entire dataset is downloaded, which is faster for most datasets. However, if you want to process a huge dataset, and the HF dataset supports streaming, you can load it in streaming mode
LoadHF(name="my-dataset", streaming=True)
-
To enable streaming mode by default for all Hugging Face datasets, use:
unitxt.settings.stream_hf_datasets_by_default = True
While the new defaults (full download & caching) may make the initial dataset load slower, subsequent loads will be significantly faster.
Unitxt Datasets Caching Policy
By default, when loading datasets with unitxt.load_dataset, the dataset is prepared from scratch each time you call the function.
This ensures that any changes made to the card definition are reflected in the output.
-
This process may take a few seconds, and for large datasets, repeated loading can accumulate overhead.
-
If you are using fixed datasets from the catalog, you can enable caching for Unitxt datasets and thus cache the unitxt datasets.
The datasets are cached in the huggingface cache (typically ~/.cache/huggingface/datasets).from unitxt import load_dataset ds = load_dataset(card="my_card", use_cache=True)
Faster Unitxt Dataset Preparation
To improve dataset loading speed, we have optimized how Unitxt datasets are prepared.
Background:
Unitxt datasets are converted to Hugging Face datasets because they store data on disk while keeping only the necessary parts in memory (via PyArrow). This enables efficient handling of large datasets without excessive memory usage.
Previously, unitxt.load_dataset used built-in Hugging Face methods for dataset preparation, which included unnecessary type handling and verification, slowing down the process.
Key improvements:
- We now create the Hugging Face dataset directly, reducing preparation time by almost 50%.
- With this optimization, Unitxt datasets are now faster than ever!
What's Changed
- End of year summary blog post by @elronbandel in #1530
- Updated documentation and examples of LLM-as-Judge by @tejaswini in #1532
- Eval assist documentation by @tejaswini in #1537
- Update notification banner styles and add 2024 summary blog link by @elronbandel in #1538
- Add more granite llm as judge artifacts by @martinscooper in #1516
- Fix Australian legal qa dataset by @elronbandel in #1542
- Set use 1 shot for wikitq in tables_benchmark by @yifanmai in #1541
- Bugfix: indexed row major serialization fails with None cell values by @yifanmai in #1540
- Solve issue of expired token in Unitxt Assistant by @eladven in #1543
- Add Replicate inference support by @elronbandel in #1544
- add a filter to wikitq by @ShirApp in #1547
- Add text2sql tasks by @perlitz in #1414
- Add deduplicate operator by @elronbandel in #1549
- Fix the authentication problem by @eladven in #1550
- Attach assitant answers to their origins with url link by @elronbandel in #1528
- Add mtrag benchmark by @elronbandel in #1548
- Update end of year summary blog by @elronbandel in #1552
- Add data classification policy to CrossProviderInferenceEngine initialization based on selected model by @elronbandel in #1539
- Fix recently broken rag metrics by @elronbandel in #1554
- Renamed criterias in LLM-as-a-Judge metrics to criteria - Breaking change by @tejaswini in #1545
- Finqa hash to top by @elronbandel in #1555
- Refactor safety metric to be faster and updated by @elronbandel in #1484
- Improve assistant by @elronbandel in #1556
- Feature/add global mmlu cards by @eliyahabba in #1561
- Add quality dataset by @eliyahabba in #1563
- Add CollateInstanceByField operator to group data by specific field by @sarathsgvr in #1546
- Fix prompts table benchmark by @ShirApp in #1565
- Create new IntersectCorrespondingFields operator by @pklpriv in #1531
- Add granite documents format by @elronbandel in #1566
- Revisit huggingface cache policy - BREAKING CHANGE by @elronbandel in #1564
- Add global mmlu lite sensitivity cards by @eliyahabba in #1568
- Add schema-linking by @KyleErwin in #1533
- fix the printout of empty strings in the yaml cards of the catalog by @dafnapension in #1567
- Use repr instead of to_json for unitxt dataset caching by @elronbandel in #1570
- Added key value extraction evaluation and example with images by @yoavkatz in #1529
New Contributors
- @tejaswini made their first contribution in #1532
- @KyleErwin made their first contribution in #1533
Full Changelog: 1.17.0...1.18.0