Features
- This release introduces an absolute confidence metric based on unique and most common ngrams for each supported language. It allows to build a language detector from a single language only. Such a detector serves as a binary classifier, telling you whether some text is written in your selected language or not. (#235)
Improvements
-
The new absolute confidence metric helps to improve accuracy in low accuracy mode. The mean of average detection accuracy (single words, word pairs and sentences combined) increases from 77% to 80%.
-
The rule-based algorithm for the recognition of Japanese texts has been improved. Texts including both Japanese and Chinese characters are now classified more often correctly as Japanese instead of Chinese.
-
The characters
Щщ
are now correctly identified as possible indicators for the Ukrainian language, leading to slightly higher accuracy when identifying Ukrainian texts. -
The enums provided by this library can now be copied and pickled. (#199)
-
Members of the enums provided by this library can now be created dynamically with the function
from_str()
. (#225) -
The library can now be used with Azure Artifacts. (#209)
Bug Fixes
-
Text spans created by
LanguageDetector.detect_multiple_languages_of()
sometimes skipped characters in the last span. This has been fixed. -
The tokenization of texts written in the Devanagari alphabet was flawed. This has been fixed, leading to better detection accuracy for Hindi and Marathi.
-
The classes provided by this library are not part of the
builtins
module anymore but of the correctlingua
module. (#255)
Compatibility
- The newest Python 3.13 is now officially supported.
- Support for Python 3.8 and 3.9 has been dropped. The lowest supported Python version is 3.10 now.