Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change internal representation to UTF-8 #9

Open
foxik opened this issue Apr 29, 2015 · 2 comments
Open

Change internal representation to UTF-8 #9

foxik opened this issue Apr 29, 2015 · 2 comments

Comments

@foxik
Copy link
Member

foxik commented Apr 29, 2015

Currently, we are using UCS-2 as internal encoding, which disallows us to use Unicode characters outside of BMP.

We should change the internal representation, the current plans is to use UTF-8:

  • we will use char and string datatypes
  • input and output will be in UTF-8 (as it is today)
  • tokenizer will work on input UTF-8 string and the created tokens will be pointers to the original text
  • lexicon will contain words in UTF-8 (and transitively language models and morphology will use UTF-8)
  • error model will be in UTF-8, i.e. it will contain variable-length strings instead of tuples or triples Unicode characters
  • the SimWordsFinder::Find will have to interpret the UTF-8 encoding and understand that one Unicode character can be represented as multiple code units. Maybe the input word will be converted to UTF-32, but I do not think so, because both lexicon and error model will be in UTF-8

The alternative to UTF-8 is to use UTF-32, but

  • using UTF-8 is a standard solution, it is being used in Python/Perl (and for example in Python UTF-16/UTF-32 were used at some point in the past)
  • the UTF-8 representation is much more compact
  • even though UTF-8 disallow constant time random access, we only access word characters sequentially in Korektor; moreover, se can always perform UTF-8 <-> UTF-32 conversion
@michalisek
Copy link
Contributor

UTF-8 is a good choice for internal representation, nevertheless, the modules that are responsible for similar words finding should in my opinion use UTF-32 internally:

  • Lexicon - the current implementation based on TRIE requires character of fixed length
  • SimWordsFinder - this class uses direct character access by index
  • ErrorModel - should use the same encoding as the SimWordsFinder (since error model is queried by SimWordsFinder)

Pros of using UTF-32 internally in the above classes

  • faster code
  • simpler code
  • less code changes required

Cons of using UTF-32 internally in the above classes

  • higher memory consumption (only Lexicon matters, error models are small in comparison)

I think that the pros far outweight the cons.

@foxik
Copy link
Member Author

foxik commented May 20, 2015

From my point of view:

  • I am not sure the code will be faster with UTF-32, as the keys of the ErrorModel will be larger (4x for ASCII, ~3x for Czech)

  • The UTF-8 will require more complicated code, but

    • Lexicon will be unaffected (it will store bytes of UTF-8 encoding without understanding), except for GetSimilarWords_impl
    • ErrorModel will be unaffected (it will store bytes of UTF-8 encoding without understanding)
    • SimWordsFinder (which only handles casing) accesses the characters sequentially, so it will be simple to modify

    The most complicated method will be Lexicon::GetSimilarWords_impl, because it will have to deal with

    • when adding/replacing a character, it has to add possibly multiple bytes from the Lexicon trie
    • when deleting/replacing character from input string, it will have to remove possibly multiple bytes (from the end of the string)
  • the language models will eventually be in UTF-8 (either when we use library like kenlm, or when we rewrite them to use hashes)

  • eventually I want to rewrite Lexicon structure (it currently takes more time to find the suggestions than to query the language models), and UTF-8 will be much more suited for the new representation I have in mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants