- Dropped Python 3.8, added Python 3.13 (requires-python = ">=3.9")
- License Format: Changed to PEP 639 bare SPDX format (license = "Apache-2.0")
- Build System: Updated setuptools requirement from >=64 to >=77 for PEP 639 support
- mecab-ko: Updated dependency to >=1.0.2,<2.0.0 (now supports Apple Silicon and Python 3.12+)
- new tokenizer:
spBLEU-1K
- Mistake in tagging v2.5.0
- WMT24 test sets
- Convert Changelog to markdown format
- Add optimization for compute_bleu precision initialization (#257) Thanks to Ernests Lavrinovics for this contribution.
Added:
- Add printing of domain if present (via --echo)
Fixed:
- Add exports to package init.py
Added:
- WMT23 test sets (test set
wmt23)
Fixed:
- Typing issues (#249, #250)
- Improved builds (#252)
Fixed:
- Special treatment of empty references in TER (#232)
- Bump in mecab version for JA (#234)
Added:
- Warning if
-tok spmis used (use explicitflores101instead) (#238)
Bugfix:
- Set lru_cache to 2^16 for SPM tokenizer (was set to infinite)
Features:
- (#203) Added
-tok flores101and-tok flores200, a.k.a.spbleu. These are multilingual tokenizations that make use of the multilingual SPM models released by Facebook and described in the following papers:- Flores-101: https://arxiv.org/abs/2106.03193
- Flores-200: https://arxiv.org/abs/2207.04672
- (#213) Added JSON formatting for multi-system output (thanks to Manikanta Inugurthi @me-manikanta)
- (#211) You can now list all test sets for a language pair with
--list SRC-TRG. Thanks to Jaume Zaragoza (@ZJaume) for adding this feature. - Added WMT22 test sets (test set
wmt22) - System outputs: include with wmt22. Also added wmt21/systems which will produce WMT21 submitted systems.
To see available systems, give a dummy system to
--echo, e.g.,sacrebleu -t wmt22 -l en-de --echo ?
Bugfix: Standard usage was returning (and using) each reference twice.
Features:
- Added WMT21 datasets (thanks to @BrighXiaoHan)
--echonow exposes document metadata where available (e.g., docid, genre, origlang)- Bugfix: allow empty references (#161)
- Adds a Korean tokenizer (thanks to @NoUnique)
Under the hood:
- Moderate code refactoring
- Processed files have adopted a more sensible internal naming scheme under ~/.sacrebleu (e.g., wmt17_ms.zh-en.src instead of zh-en.zh)
- Processed file extensions correspond to the values passed to
--echo(e.g., "src") - Now explicitly representing NoneTokenizer
- Got rid of the ".lock" lockfile for downloading (using the tarball itself)
Many thanks to @BrightXiaoHan (https://github.com/BrightXiaoHan) for the bulk of the code contributions in this release.
Features:
- Added
-tok spmfor multilingual SPM tokenization (#168) (thanks to Naman Goyal and James Cross at Facebook)
Fixes:
- Handle potential memory usage issues due to LRU caching in tokenizers (#167)
- Bugfix: BLEU.corpus_score() now using max_ngram_order (#173)
- Upgraded ja-mecab to 1.0.5 (#196)
- Build: Add Windows and OS X testing to Travis CI.
- Improve documentation and type annotations.
- Drop
Python < 3.6support and migrate to f-strings. - Relax
portalockerversion pinning, addregex, tabulate, numpydependencies. - Drop input type manipulation through
isinstancechecks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (#121) - Variable # references per segment is supported for all metrics by default. It is still only available through the API.
- Use colored strings in tabular outputs (multi-system evaluation mode) through
the help of
coloramapackage. - tokenizers: Add caching to tokenizers which seem to speed up things a bit.
intltokenizer: Useregexmodule. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (#46)- Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'.
- Signature: Boolean true / false values are shortened to yes / no.
- Signature: Number of references is
varif variable number of references is used. - Signature: Add effective order (yes/no) to BLEU and chrF signatures.
- Metrics: Scale all metrics into the [0, 100] range (#140)
- Metrics API: Use explicit argument names and defaults for the metrics instead of
passing obscure
argparse.Namespaceobjects. - Metrics API: A base abstract
Metricclass is introduced to guide further metric development. This class defines the methods that should be implemented in the derived classes and offers boilerplate methods for the common functionality. A new metric implemented this way will automatically support significance testing. - Metrics API: All metrics now receive an optional
referencesargument at initialization time to process and cache the references. Further evaluations of different systems against the same references becomes faster this way for example when using significance testing. - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (#141).
- CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
- CHRF: Added chrF+ support through
word_orderargument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (#124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (#144)
- CLI:
--input/-ican now ingest multiple systems. For this reason, the positionalreferencesshould always preceed the-iflag. - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
- CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same.
- CLI: Separate metric-specific arguments for clarity when
--helpis printed. - CLI: Added
--format/-fflag. The single-system output mode is nowjsonby default. If you want to keep the old text format persistently, you can exportSACREBLEU_FORMAT=textinto your shell. - CLI: For multi-system mode,
jsonfalls back to plain text.latexoutput can only be generated for multi-system mode. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set
in an efficient way. Through the use of
tabulatepackage, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to-i/--inputor as a tab-separated single stream redirected intoSTDIN. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation
through bootstrap resampling for single-system evaluation (
--confidenceflag) as well as paired bootstrap resampling (--paired-bs) and paired approximate randomization tests (--paired-ar) when evaluating multiple systems (#40 and #78).
- Fix extraction error for WMT18 extra test sets (test-ts) (#142)
- Validation and test datasets are added for multilingual TEDx
- Fix an assertion error in chrF (#121)
- Add missing
__repr__()methods for BLEU and TER - TER: Fix exception when
--shortis used (#131) - Pin Mecab version to 1.0.3 for Python 3.5 support
- [API Change]: Default value for
floorsmoothing is now 0.1 instead of 0. - [API Change]:
sacrebleu.sentence_bleu()now uses theexpsmoothing method, exactly the same as the CLI's --sentence-level behavior. This was mainly done to make two methods behave the same. - Add smoothing value to BLEU signature (#98)
- dataset: Fix IWSLT links (#128)
- Allow variable number of references for BLEU (only via API) (#130). Thanks to Ondrej Dusek (@tuetschek)
- Added character-based tokenization (
-tok char). Thanks to Christian Federmann. - Added TER (
-m ter). Thanks to Ales Tamchyna! (fixes #90) - Allow calling the script as a standalone utility (fixes #86)
- Fix type annotation issues (fixes #100) and mark sacrebleu as supporting mypy
- Added WMT20 robustness test sets:
- wmt20/robust/set1 (en-ja, en-de)
- wmt20/robust/set2 (en-ja, ja-en)
- wmt20/robust/set3 (de-en)
- Added WMT20 newstest test sets (#103)
- Make mecab3-python an extra dependency, adapt code to new mecab3-python This fixes the recent Windows installation issues as well (#104) Japanese support should now be explicitly installed through sacrebleu[ja] package.
- Fix return type annotation of corpus_bleu()
- Improve sentence_score's documentation, do not allow single ref string (#98)
- Fix a deployment bug (#96)
- Added Multi30k multimodal MT test set metadata
- Refactored all tokenizers into respective classes (fixes #85)
- Refactored all metrics into respective classes
- Moved utility functions into
utils.py - Implemented signatures using
BLEUSignatureandCHRFSignatureclasses - Simplified checking of Chinese characters (fixes #5)
- Unified common regexp tokenization codes for tokenizers (fixes #27)
- Fixed --detail failing when no test sets are provided
- Fixed multi-reference BLEU failing when tab-delimited reference stream is used
- Removed lowercase option for ChrF which was not functional (#85)
- Simplified ChrF and used the same I/O logic as BLEU to allow for future multi-reference reading
- Added score regression tests for chrF using reference chrF++ implementation
- Added multi-reference & tokenizer & signature tests
- Fixed bug in signature with mecab tokenizer
- Cleaned up deprecation warnings (thanks to Karthikeyan Singaravelan @tirkarthi)
- Now only lists the external typing
module as a dependency for Python
<= 3.4, as it was integrated in the standard library in Python 3.5 (thanks to Erwan de Lépinau @ErwanDL). - Added LICENSE to pypi (thanks to Mark Harfouche @hmaarrfk)
- Changed
get_available_testsets()to return a list - Remove Japanese MeCab tokenizer from requirements. (Must be installed manually to avoid Windows incompatibility). Many thanks to Makoto Morishita (@MorinoseiMorizo).
- Added to API:
- get_source_file()
- get_reference_files()
- get_available_testsets()
- get_langpairs_for_testset()
- Some internal refactoring
- Fixed descriptions of some WMT19/google test sets
- Added API test case (test/test_apy.py)
- Added Google's extra wmt19/en-de refs (-t wmt19/google/{ar,arp,hqall,hqp,hqr,wmtp}) (Freitag, Grangier, & Caswell BLEU might be Guilty but References are not Innocent https://arxiv.org/abs/2004.06063)
- Restored SACREBLEU_DIR and smart_open to exports (thanks to Thomas Liao @tholiao)
- Large internal reorganization as a module (thanks to Thamme Gowda @thammegowda)
- Added Japanese MeCab tokenizer (
-tok ja-mecab) (thanks to Makoto Morishita @MorinoseiMorizo) - Added wmt20/dev test sets (thanks to Martin Popel @martinpopel)
- Smoothing changes (Sebastian Nickels @sn1c)
- Fixed bug that only applied smoothing to n-grams for n > 2
- Added default smoothing values for methods "floor" (0) and "add-k" (1)
--listnow returns a list of all language pairs for a task when combined with-t(e.g.,sacrebleu -t wmt19 --list)- added missing languages for IWSLT17
- Minor code improvements (Thomas Liao @tholiao)
- Bugfix: handling of result object for CHRF
- Improved API example
- Tokenization variant omitted from the chrF signature; it is relevant only for BLEU (thanks to Martin Popel)
- Bugfix: call to sentence_bleu (thanks to Rachel Bawden)
- Documentation example for Python API (thanks to Vlad Lyalin)
- Calls to corpus_chrf and sentence_chrf now return a an object instead of a float (use result.score)
- Added sentence-level scoring via -sl (--sentence-level)
- Many thanks to Martin Popel for all the changes below!
- Added evaluation on concatenated test sets (e.g.,
-t wmt17,wmt18). Works as long as they all have the same language pair. - Added
sacrebleu --origlang(both for evaluation on a subset and for--echo). Note that while echoing prints just the subset, evaluation expects the complete test set (and just skips the irrelevant parts). - Added
sacrebleu --detailfor breakdown by domain-specific subsets of the test sets. (Available for WMT19). - Minor changes
- Improved display of
sacrebleu -h - Added
sacrebleu --list - Code refactoring
- Documentation and tests updates
- Fixed a race condition bug (
os.makedirs(outdir, exist_ok=True)instead ofif os.path.exists)
- Improved display of
- Lazy loading of regexes cuts import time from ~1s to nearly nothing (thanks, @louismartin!)
- Added a simple (non-atomic) lock on downloading
- Can now read multiple refs from a single tab-delimited file.
You need to pass
--num-refs Nto tell it to run the split. Only works with a single reference file passed from the command line.
- Removed another f-string for Python 3.5 compatibility
- Restored Python 3.5 compatibility
- Added MTNT 2019 test sets
- Added a BLEU object
- Added WMT'19 test sets
- Bugfix in test case (thanks to Adam Roberts, @adarob)
- Passing smoothing method through
sentence_bleu
- Added another smoothing approach (add-k) and a command-line option for choosing the smoothing method
(
--smooth exp|floor|add-n|none) and the associated value (--smooth-value), when relevant. - Changed interface to some functions (backwards incompatible)
- 'smooth' is now 'smooth_method'
- 'smooth_floor' is now 'smooth_value'
- Ctrl-M characters are now treated as normal characters, previously treated as newline.
- Tokenization now defaults to "zh" when language pair is known
- Updated checksum for wmt19/dev (seems to have changed)
- Fixed checksum for wmt17/dev (copy-paste error)
- Added kk-en and en-kk to wmt19/dev
- Added gu-en and en-gu to wmt19/dev
- Added MD5 checksumming of downloaded files for all datasets.
- Added mtnt1.1/train mtnt1.1/valid mtnt1.1/test data from MTNT
- Added 'wmt19/dev' task for 'lt-en' and 'en-lt' (development data for new tasks).
- Added MD5 checksum for downloaded tarballs.
- Now outputs only only digit after the decimal
- Added a function for sentence-level, smoothed BLEU
- Added wmt18 test set (with references)
- Added zh-en, en-zh, tr-en, and en-tr datasets for wmt18/test-ts
- Added wmt18/test-ts, the test sources (only) for WMT18
- Moved README out of
sacrebleu.pyand the CHANGELOG into a separate file
- fixed another locale issue (with --echo)
- grudgingly enabled
-tok nonefrom the command line
- added wmt17/ms (Microsoft's additional ZH-EN references).
Try
sacrebleu -t wmt17/ms --cite. --echo refnow pastes together all references, if there is more than one
- added wmt18/dev datasets (en-et and et-en)
- fixed logic with --force
- locale-independent installation
- added "--echo both" (tab-delimited)
- metrics (
-m) are now printed in the order requested - chrF now prints a version string (including the beta parameter, importantly)
- attempt to remove dependence on locale setting
- added the chrF metric (
-m chrfor-m bleu chrffor both) See 'CHRF: character n-gram F-score for automatic MT evaluation' by Maja Popovic (WMT 2015) [http://www.statmt.org/wmt15/pdf/WMT49.pdf] - added IWSLT 2017 test and tuning sets for DE, FR, and ZH (Thanks to Mauro Cettolo and Marcello Federico).
- added
--citeto produce the citation for easy inclusion in papers - added
--input(-i) to set input to a file instead of STDIN - removed accent mark after objection from UN official
- corpus_bleu() now raises an exception if input streams are different lengths
- thanks to Martin Popel for:
- small bugfix in tokenization_13a (not affecting WMT references)
- adding
--tok intl(international tokenization)
- added wmt17/dev and wmt17/dev sets (for languages intro'd those years)
- bugfix for tokenization warning
- added -b option (only output the BLEU score)
- removed fi-en from list of WMT16/17 systems with more than one reference
- added WMT16/tworefs and WMT17/tworefs for scoring with both en-fi references
- added effective order for sentence-level BLEU computation
- added unit tests from sockeye
- Factored code a bit to facilitate API:
- compute_bleu: works from raw stats
- corpus_bleu for use from the command line
- raw_corpus_bleu: turns off tokenization, command-line sanity checks, floor smoothing
- Smoothing (type 'exp', now the default) fixed to produce mteval-v13a.pl results
- Added 'floor' smoothing (adds 0.01 to 0 counts, more versatile via API), 'none' smoothing (via API)
- Small bugfixes, windows compatibility (H/T Christian Federmann)
- Contributions from Christian Federmann:
- Added explicit support for encoding
- Fixed Windows support
- Bugfix in handling reference length with multiple refs
- Small bugfix affecting some versions of Python.
- Code reformatting due to Ozan Çağlayan.
- Support for WMT 2008--2017.
- Single tokenization (v13a) with lowercase fix (proper lower() instead of just A-Z).
- Chinese tokenization.
- Tested to match all WMT17 scores on all arcs.