-
The language model files have been converted into a new storage format. They are now stored as finite-state transducers (FSTs) which reduces memory consumption drastically at the cost of a slightly slower runtime performance. FSTs allow to be searched on disk without actually reading them entirely into memory which requires only a few dozen megabytes of memory even when loading all languages. The former hashmap-based approach required at least hundreds of megabytes of memory. (#287)
-
The language model files are not compressed by the Brotli algorithm anymore. This means that they can be loaded into memory much faster and thereby avoid latency issues in e.g. web services nearly entirely. The new FST storage helps in this regard as well. The only downside is that the language model files have grown in size on disk. They now consume approximately 300 MB altogether instead of 110 MB as before. The file size of the WASM module is also affected by that.
-
The unique and most common ngrams for each language now improve language detection accuracy a bit when the low-accuracy mode is enabled. In previous releases, unique and most common ngrams were only taken into consideration when the single-language mode was active.
- The test data files for Latin and Welsh contained broken characters which resulted in inaccurate accuracy reports for these languages. This has been fixed. (#288)
- The newest Python 3.14 is now officially supported. (#273)
- Support for Python 3.10 and 3.11 has been dropped. The lowest supported Python version is 3.12 now.
- In low accuracy mode, the language detector could produce random results for certain kinds of text. This has been fixed.
- This release introduces an absolute confidence metric based on unique and most common ngrams for each supported language. It allows to build a language detector from a single language only. Such a detector serves as a binary classifier, telling you whether some text is written in your selected language or not. (#235)
-
The new absolute confidence metric helps to improve accuracy in low accuracy mode. The mean of average detection accuracy (single words, word pairs and sentences combined) increases from 77% to 80%.
-
The rule-based algorithm for the recognition of Japanese texts has been improved. Texts including both Japanese and Chinese characters are now classified more often correctly as Japanese instead of Chinese.
-
The characters
Щщare now correctly identified as possible indicators for the Ukrainian language, leading to slightly higher accuracy when identifying Ukrainian texts. -
The enums provided by this library can now be copied and pickled. (#199)
-
Members of the enums provided by this library can now be created dynamically with the function
from_str(). (#225) -
The library can now be used with Azure Artifacts. (#209)
-
Text spans created by
LanguageDetector.detect_multiple_languages_of()sometimes skipped characters in the last span. This has been fixed. -
The tokenization of texts written in the Devanagari alphabet was flawed. This has been fixed, leading to better detection accuracy for Hindi and Marathi.
-
The classes provided by this library are not part of the
builtinsmodule anymore but of the correctlinguamodule. (#255)
- The newest Python 3.13 is now officially supported.
- Support for Python 3.8 and 3.9 has been dropped. The lowest supported Python version is 3.10 now.
- The rule-based algorithm for the recognition of Japanese texts has been improved. Texts including both Japanese and Chinese characters are now classified more often correctly as Japanese instead of Chinese.
- Text spans created by
LanguageDetector.detect_multiple_languages_of()sometimes skipped characters in the last span. This has been fixed. (#247)
- This release introduces an absolute confidence metric based on unique and most common ngrams for each supported language. It allows to build a language detector from a single language only. Such a detector serves as a binary classifier, telling you whether some text is written in your selected language or not. (#235)
- The new absolute confidence metric helps to improve accuracy in low accuracy mode. The mean of average detection accuracy (single words, word pairs and sentences combined) increases from 77% to 80%.
- The tokenization of texts written in the Devanagari alphabet was flawed. This has been fixed, leading to better detection accuracy for Hindi and Marathi.
- The newest Python 3.13 is now officially supported.
- Support for Python 3.8 and 3.9 has been dropped. The lowest supported Python version is 3.10 now.
-
The language models are now stored in dictionaries instead of NumPy arrays. This change leads to significantly improved runtime performance at the cost of higher memory consumption (up to 3 GB for all models). As the runtime performance was much too slow with the former approach, this change makes sense because adding more memory is quite cheap.
-
The language model files are now compressed with the Brotli algorithm which reduces the file size by 15 %, on average.
-
The characters
Щщare now correctly identified as possible indicators for the Ukrainian language, leading to slightly higher accuracy when identifying Ukrainian texts.
- All dependencies have been updated to their latest versions.
- Type stubs for the Python bindings are now available, allowing better static code analysis, better code completion in supported IDEs and easier understanding of the library's API. (#197)
- The method
LanguageDetector.detect_multiple_languages_ofstill returned character indices instead of byte indices when only a singleDetectionResultwas produced. This has been fixed. (#203, #205)
-
The method
LanguageDetector.detect_multiple_languages_ofreturns byte indices. For creating string slices in Python and JavaScript, character indices are needed but were not provided. This resulted in incorrectDetectionResults for Python and JavaScript. This has been fixed now by converting the byte indices to character indices. (#192) -
Some minor bugs in the WASM module have been fixed to prepare the first release of Lingua for JavaScript.
-
Python bindings for the Rust implementation of Lingua have now replaced the pure Python implementation in order to benefit from Rust's performance in any Python software.
-
Parallel equivalents for all methods in
LanguageDetectorhave been added to give the user the choice of using the library single-threaded or multi-threaded.
-
This release resolves some dependency issues so that the latest versions of dependencies NumPy, Pandas and Matplotib can be used with Python >= 3.9 while older versions are used with Python 3.8.
-
All dependencies have been updated to their latest versions.
- Processing the language models now performs a little faster by performing binary search on the language model NumPy arrays.
-
Several bugs in multiple languages detection have been fixed that caused incomplete results to be returned in several cases. (#143, #154)
-
A significant amount of Kazakh texts were incorrectly classified as Mongolian. This has been fixed. (#160)
-
A new section on performance tips has been added to the README.
-
All dependencies have been updated to their latest versions.
- After applying some internal optimizations, language detection is now faster, at least between 20% and 30%, approximately. For long input texts, the speed improvement is greater than for short input texts.
- For long input texts, an error occurred whiled computing the confidence values due to numerical underflow when converting probabilities. This has been fixed. Thanks to @jordimas for reporting this bug. (#102)
- The min-max normalization method for the confidence values has been replaced with applying the softmax function. This gives more realistic probabilities. Big thanks to @Alex-Kopylov for proposing and implementing this change. (#99)
- Under certain circumstances, calling the method
LanguageDetector.detect_multiple_languages_of()raised anIndexError. This has been fixed. Thanks to @Saninsusanin for reporting this bug. (#98)
-
The new method
LanguageDetector.detect_multiple_languages_of()has been introduced. It allows to detect multiple languages in mixed-language text. (#4) -
The new method
LanguageDetector.compute_language_confidence()has been introduced. It allows to retrieve the confidence value for one specific language only, given the input text. (#86)
- The computation of the confidence values has been revised and the min-max normalization algorithm is now applied to the values, making them better comparable by behaving more like real probabilities. (#78)
- The library now has a fresh and colorful new logo. Why? Well, why not? (-:
- An
__all__variable has been added indicating which types are exported by the library. This helps with type checking programs using Lingua. Big thanks to @bscan for the pull request. (#76) - The rule-based language filter has been improved for German texts. (#71)
- A further bottleneck in the code has been removed, making language detection 30 % faster compared to version 1.1.2, approximately.
- The language models are now stored on disk as serialized NumPy arrays instead of JSON. This reduces the preloading time of the language models significantly.
- A bottleneck in the language detection code has been removed, making language detection 40 % faster, approximately.
- The
py.typedfile that actives static type checking was missing. Big thanks to @Vasniktel for reporting this problem. (#63) - The ISO 639-3 code for Urdu was wrong. Big thanks to @pluiez for reporting this bug. (#64)
- For certain ngrams, wrong probabilities were returned. This has been fixed. Big thanks to @3a77 for reporting this bug. (#62)
- The new method
LanguageDetectorBuilder.with_low_accuracy_mode()has been introduced. By activating it, detection accuracy for short text is reduced in favor of a smaller memory footprint and faster detection performance.
- The memory footprint has been reduced significantly by storing the language models in structured NumPy arrays instead of dictionaries. This reduces memory consumption from 2600 MB to 800 MB, approximately.
- Several language model files have become obsolete and could be deleted without decreasing detection accuracy. This results in a smaller memory footprint.
- The lowest supported Python version is 3.8 now. Python 3.7 is no longer compatible with this library.
- This patch release makes the library compatible with Python >= 3.7.1. Previously, it could be installed from PyPI only with Python >= 3.9.
- The very first release of Lingua. Enjoy!