Skip to content

Lingua 2.1.0

Latest
Compare
Choose a tag to compare
@pemistahl pemistahl released this 20 Mar 18:45

Features

  • This release introduces an absolute confidence metric based on unique and most common ngrams for each supported language. It allows to build a language detector from a single language only. Such a detector serves as a binary classifier, telling you whether some text is written in your selected language or not. (#235)

Improvements

  • The new absolute confidence metric helps to improve accuracy in low accuracy mode. The mean of average detection accuracy (single words, word pairs and sentences combined) increases from 77% to 80%.

  • The rule-based algorithm for the recognition of Japanese texts has been improved. Texts including both Japanese and Chinese characters are now classified more often correctly as Japanese instead of Chinese.

  • The characters Щщ are now correctly identified as possible indicators for the Ukrainian language, leading to slightly higher accuracy when identifying Ukrainian texts.

  • The enums provided by this library can now be copied and pickled. (#199)

  • Members of the enums provided by this library can now be created dynamically with the function from_str(). (#225)

  • The library can now be used with Azure Artifacts. (#209)

Bug Fixes

  • Text spans created by LanguageDetector.detect_multiple_languages_of() sometimes skipped characters in the last span. This has been fixed.

  • The tokenization of texts written in the Devanagari alphabet was flawed. This has been fixed, leading to better detection accuracy for Hindi and Marathi.

  • The classes provided by this library are not part of the builtins module anymore but of the correct lingua module. (#255)

Compatibility

  • The newest Python 3.13 is now officially supported.
  • Support for Python 3.8 and 3.9 has been dropped. The lowest supported Python version is 3.10 now.