Releases: pemistahl/lingua-py
Lingua 1.4.0
Features
- This release introduces an absolute confidence metric based on unique and most common ngrams for each supported language. It allows to build a language detector from a single language only. Such a detector serves as a binary classifier, telling you whether some text is written in your selected language or not. (#235)
Improvements
- The new absolute confidence metric helps to improve accuracy in low accuracy mode. The mean of average detection accuracy (single words, word pairs and sentences combined) increases from 77% to 80%.
Bug Fixes
- The tokenization of texts written in the Devanagari alphabet was flawed. This has been fixed, leading to better detection accuracy for Hindi and Marathi.
Compatibility
- The newest Python 3.13 is now officially supported.
- Support for Python 3.8 and 3.9 has been dropped. The lowest supported Python version is 3.10 now.
Please note: All new features and bug fixes will also be part of the next Rust-based Python extension release 2.1.0.
Lingua 1.3.5
Improvements
-
The language models are now stored in dictionaries instead of NumPy arrays. This change leads to significantly improved runtime performance at the cost of higher memory consumption (up to 3 GB for all models). As the runtime performance was much too slow with the former approach, this change makes sense because adding more memory is quite cheap.
-
The language model files are now compressed with the Brotli algorithm which reduces the file size by 15 %, on average.
-
The characters
Щщ
are now correctly identified as possible indicators for the Ukrainian language, leading to slightly higher accuracy when identifying Ukrainian texts.
Miscellaneous
- All dependencies have been updated to their latest versions.
Lingua 2.0.2
Improvements
- Type stubs for the Python bindings are now available, allowing better static code analysis, better code completion in supported IDEs and easier understanding of the library's API. (#197)
Bug Fixes
- The method
LanguageDetector.detect_multiple_languages_of
still returned character indices instead of byte indices when only a singleDetectionResult
was produced. This has been fixed. (#203, #205)
Please note: Due to project size limits on PyPI, the Python wheels for previous version 2.0.1 had to be deleted. Please use 2.0.2 instead.
Lingua 2.0.1
Bug Fixes
- The method
LanguageDetector.detect_multiple_languages_of
returns byte indices. For creating string slices in Python, character indices are needed but were not provided. This resulted in incorrectDetectionResult
s for Python. This has been fixed now by converting the byte indices to character indices. Big thanks to @boltonn for the bug report. (#192)
Please note: Due to project size limits on PyPI, the Python wheels for previous version 2.0.0 had to be deleted. Please use 2.0.1 instead.
Lingua 2.0.0
Features
-
Python bindings for the Rust implementation of Lingua have now replaced the pure Python implementation in order to benefit from Rust's performance in any Python software.
-
Parallel equivalents for all methods in
LanguageDetector
have been added to give the user the choice of using the library single-threaded or multi-threaded.
Lingua 1.3.4
Miscellaneous
-
This release resolves some dependency issues so that the latest versions of dependencies NumPy, Pandas and Matplotib can be used with Python >= 3.9 while older versions are used with Python 3.8.
-
All dependencies have been updated to their latest versions.
Lingua 1.3.3
Improvements
- Processing the language models now performs a little faster by performing binary search on the language model NumPy arrays.
Bug Fixes
-
Several bugs in multiple languages detection have been fixed that caused incomplete results to be returned in several cases. (#143, #154)
-
A significant amount of Kazakh texts were incorrectly classified as Mongolian. This has been fixed. (#160)
Miscellaneous
-
A new section on performance tips has been added to the README.
-
All dependencies have been updated to their latest versions.
Lingua 1.3.2
Improvements
- After applying some internal optimizations, language detection is now faster, at least between 20% and 30%, approximately. For long input texts, the speed improvement is greater than for short input texts.
Lingua 1.3.1
Lingua 1.3.0
Improvements
- The min-max normalization method for the confidence values has been replaced with applying the softmax function. This gives more realistic probabilities. Big thanks to @Alex-Kopylov for proposing and implementing this change. (#99)