Skip to content

Commit

Permalink
Hacked together high/low confidence. (#110)
Browse files Browse the repository at this point in the history
* Add 0.99 or 0.01 confidence for high/low confidence.

* Fix docs. 

* Simplify union.
  • Loading branch information
john-parton authored Oct 18, 2023
1 parent ba55f1d commit 4b0ec98
Show file tree
Hide file tree
Showing 10 changed files with 88 additions and 61 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,9 @@ The easiest way to get started is to use the :meth:`detect` method.
```

There is also a `detect` method available for compatability with `chardet`,
but it will always report `None` for the language and a confidence value of `0.99`.
but it will always report `None` for the language. The confidence value will either
be `0.99` or `0.01` depending on whether chardetng returns a "high" or "low"
confidence boolean.

```python
>>> from chardetng_py.compat import detect
Expand Down
6 changes: 3 additions & 3 deletions chardetng_docs/EncodingDetector.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
A Web browser-oriented detector for guessing what character
encoding a stream of bytes is encoded in.

The bytes are fed to the detector incrementally using the `feed`
The bytes are fed to the detector incrementally using the :code:`feed`
method. The current guess of the detector can be queried using
the `guess` method. The guessing parameters are arguments to the
`guess` method rather than arguments to the constructor in order
the :code:`guess` method. The guessing parameters are arguments to the
:code:`guess` method rather than arguments to the constructor in order
to enable the application to check if the arguments affect the
guessing outcome. (The specific use case is to disable UI for
re-running the detector with UTF-8 allowed and the top-level
Expand Down
26 changes: 14 additions & 12 deletions chardetng_docs/feed.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,32 +7,34 @@ chooses to chunk the stream. It is OK to call this method with
a zero-length byte slice.

The end of the stream is indicated by calling this method with
`last` set to `True`. In that case, the end of the stream is
considered to occur after the last byte of the `buffer` (which
:code:`last` set to :code:`True`. In that case, the end of the stream is
considered to occur after the last byte of the :code:`buffer` (which
may be zero-length) passed in the same call. Once this method
has been called with `last` set to `True` this method must not
has been called with :code:`last` set to :code:`True` this method must not
be called again.

If you want to perform detection on just the prefix of a longer
stream, do not pass `last=true` after the prefix if the stream
stream, do not pass :code:`last=True` after the prefix if the stream
actually still continues.

Returns `True` if after processing `buffer` the stream has
contained at least one non-ASCII byte and `False` if only
Returns :code:`True` if after processing :code:`buffer` the stream has
contained at least one non-ASCII byte and :code:`False` if only
ASCII has been seen so far.

## Parameters

buffer : bytes or bytearray
last : bool
buffer : :code:`bytes` or :code:`bytearray`
The next chunk of the byte stream.
last : :code:`bool`
Whether this is the last chunk of the byte stream.

## Returns

bool
`True` if the stream has contained at least one non-ASCII byte
and `False` if only ASCII has been seen so far.
:code:`bool`
:code:`True` if the stream has contained at least one non-ASCII byte
and :code:`False` if only ASCII has been seen so far.

## Raises

pyo3_runtime.PanicException
If this method has previously been called with `last` set to `True`.
If this method has previously been called with :code:`last` set to :code:`True`.
23 changes: 12 additions & 11 deletions chardetng_docs/guess.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,34 @@
Guess the encoding given the bytes pushed to the detector so far
(via `feed()`), the top-level domain name from which the bytes were
(via :code:`feed()`), the top-level domain name from which the bytes were
loaded, and an indication of whether to consider UTF-8 as a permissible
guess.

## Parameters

tld : bytes or bytearray or None
tld : :code:`bytes` or :code:`bytearray` or :code:`None`
The rightmost DNS label of the hostname of the
host the stream was loaded from in lower-case ASCII form. That is, if
the label is an internationalized top-level domain name, it must be
provided in its Punycode form. If the TLD that the stream was loaded
from is unavalable, `None` may be passed instead, which is equivalent
to passing `b"com"`.
allow_utf8 : bool
If set to `False`, the return value of
this method won't be `"UTF-8"`. When performing detection
on `text/html` on non-`file:` URLs, Web browsers must pass `False`,
unless the user has taken a specific contextual action to request an
from is unavalable, :code:`None` may be passed instead, which is equivalent
to passing :code:`b"com"`.
allow_utf8 : :code:`bool`
If set to :code:`False`, the return value of
this method won't be :code:`"UTF-8"`. When performing detection
on :code:`text/html` on non-:code:`file:` URLs, Web browsers must pass
:code:`False`, unless the user has taken a specific contextual action to request an
override. This way, Web developers cannot start depending on UTF-8
detection. Such reliance would make the Web Platform more brittle.

## Returns

str
:code:`str`
The guessed encoding.

## Raises

pyo3_runtime.PanicException
If `tld` contains non-ASCII, period, or upper-case letters. The exception
If :code:`tld` contains non-ASCII, period, or upper-case letters. The exception
condition is intentionally limited to signs of failing to extract the
label correctly, failing to provide it in its Punycode form, and failure
to lower-case it. Full DNS label validation is intentionally not performed
Expand Down
28 changes: 14 additions & 14 deletions chardetng_docs/guess_assess.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,38 @@
Performs the same function as `guess()` with the same parameters, but
Performs the same function as :code:`guess()` with the same parameters, but
additionally returns whether the guessed encoding had a higher score than
at least one other candidate. If this method returns `False`, the
at least one other candidate. If this method returns :code:`False`, the
guessed encoding is likely to be wrong.

## Parameters

tld : bytes or bytearray or None
tld : :code:`bytes` or :code:`bytearray` or :code:`None`
The rightmost DNS label of the hostname of the
host the stream was loaded from in lower-case ASCII form. That is, if
the label is an internationalized top-level domain name, it must be
provided in its Punycode form. If the TLD that the stream was loaded
from is unavalable, `None` may be passed instead, which is equivalent
to passing `b"com"`.
allow_utf8 : bool
If set to `False`, the return value of
this method won't be `"UTF-8"`. When performing detection
on `text/html` on non-`file:` URLs, Web browsers must pass `False`,
unless the user has taken a specific contextual action to request an
from is unavalable, :code:`None` may be passed instead, which is equivalent
to passing :code:`b"com"`.
allow_utf8 : :code:`bool`
If set to :code:`False`, the return value of
this method won't be :code:`"UTF-8"`. When performing detection
on :code:`text/html` on non-:code:`file:` URLs, Web browsers must pass
:code:`False`, unless the user has taken a specific contextual action to request an
override. This way, Web developers cannot start depending on UTF-8
detection. Such reliance would make the Web Platform more brittle.

## Returns

encoding: str
encoding: :code:`str`
The guessed encoding.
higher_score: bool
higher_score: :code:`bool`
Whether the guessed encoding had a higher score than at least
one other candidate. If this value is `False`, the guessed encoding
one other candidate. If this value is :code:`False`, the guessed encoding
is likely to be wrong.

## Raises

pyo3_runtime.PanicException
If `tld` contains non-ASCII, period, or upper-case letters. The exception
If :code:`tld` contains non-ASCII, period, or upper-case letters. The exception
condition is intentionally limited to signs of failing to extract the
label correctly, failing to provide it in its Punycode form, and failure
to lower-case it. Full DNS label validation is intentionally not performed
Expand Down
5 changes: 3 additions & 2 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@ The easiest way to get started is to use the ``detect`` method.
'windows-1254'
There is also a ``detect`` method available for compatability with
``chardet``, but it will always report ``None`` for the language and a
confidence value of ``0.99``.
``chardet``, but it will always report ``None`` for the language. The confidence
value will either be ``0.99`` or ``0.01`` depending on whether chardetng returns
a "high" or "low" confidence flag.

.. code:: python
Expand Down
6 changes: 4 additions & 2 deletions python/chardetng_py/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,17 @@
be imported using :code:`from chardetng_py import detect`.
"""

from typing import Final, List
from __future__ import annotations

from typing import Final

from chardetng_py.detector import EncodingDetector
from chardetng_py.shortcuts import detect

# Documentation for the detect function is on the rust function
# in lib.rs

__all__: Final[List[str]] = [
__all__: Final[list[str]] = [
"detect",
"EncodingDetector",
]
27 changes: 16 additions & 11 deletions python/chardetng_py/compat.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
"""Functions to aid in migrating from chardet or charset_normalizer to chardetng_py."""

from __future__ import annotations

import sys
from typing import Final, Union

from chardetng_py.shortcuts import detect as _detect
from chardetng_py.detector import EncodingDetector

from .constants import CONFIDENCE_HIGH, CONFIDENCE_LOW

# Older versions of Python have a TypedDict with limited functionality
# This helps documentations tools and type checkers
Expand All @@ -13,11 +16,6 @@
from typing_extensions import TypedDict


# chardetng does not return a confidence value
# This is the value which is unconditionally returned
DEFAULT_CONFIDENCE: Final[float] = 0.99


# TypedDict was introduced in Python 3.8.
class ResultDict(TypedDict):
"""Return value for detect compatability function."""
Expand All @@ -27,20 +25,27 @@ class ResultDict(TypedDict):
language: None


def detect(byte_str: Union[bytes, bytearray]) -> ResultDict:
def detect(byte_str: bytes | bytearray) -> ResultDict:
"""Detect the encoding of a string and return additional information.
Detect the encoding of the given byte string. Language detect is not implemented
and will always return :code:`None`. Confidence is always :code:`0.99`.
and will always return :code:`None`. Confidence will be either :code:`0.99` or
:code:`0.01` depending on if the encoding is detected with high confidence or low
confidence.
Parameters
----------
byte_str : :code:`bytes` or :code:`bytearray`
Input buffer to detect the encoding of.
"""
encoding_detector = EncodingDetector()
encoding_detector.feed(byte_str, last=True)

encoding, higher_score = encoding_detector.guess_assess(tld=None, allow_utf8=False)

return {
"encoding": _detect(byte_str),
"confidence": DEFAULT_CONFIDENCE,
"encoding": encoding,
"confidence": CONFIDENCE_HIGH if higher_score else CONFIDENCE_LOW,
# chardetng does not return a language
"language": None,
}
11 changes: 11 additions & 0 deletions python/chardetng_py/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
"""Constants used by chardetng_py."""

from typing import Final

CONFIDENCE_HIGH: Final[float] = 0.99
"""chardetng does not return a confidence value, but does return a `higher_score`
boolean. This is the value that is returned when the confidence is high."""

CONFIDENCE_LOW: Final[float] = 0.01
"""chardetng does not return a confidence value, but does return a `higher_score`
boolean. This is the value that is returned when the confidence is low."""
13 changes: 8 additions & 5 deletions python/chardetng_py/shortcuts.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,18 @@
"""Functions for dealing with byte strings of unknown encoding."""

from typing import Union
from __future__ import annotations

from chardetng_py.detector import EncodingDetector


def detect(
byte_str: Union[bytes, bytearray],
byte_str: bytes | bytearray,
*,
allow_utf8: bool = False,
tld: Union[bytes, bytearray, None] = None,
tld: bytes | bytearray | None = None,
) -> str:
"""Detect the encoding of :code:`byte_str`.
Returned encoding is suitable for use with :code:`str.decode`.
Parameters
----------
byte_str : :code:`bytes` or :code:`bytearray`
Expand All @@ -33,6 +31,11 @@ def detect(
label correctly, failing to provide it in its Punycode form, and failure
to lower-case it. Full DNS label validation is intentionally not performed
to avoid panics when the reality doesn't match the specs.
Returns
-------
:code:`str`
The encoding of :code:`byte_str`, suitable for use with :code:`str.decode`.
"""
encoding_detector = EncodingDetector()
encoding_detector.feed(byte_str, last=True)
Expand Down

0 comments on commit 4b0ec98

Please sign in to comment.