Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot handle double byte characters #1

Open
brianfreud opened this issue Sep 8, 2021 · 7 comments
Open

Cannot handle double byte characters #1

brianfreud opened this issue Sep 8, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@brianfreud
Copy link

brianfreud commented Sep 8, 2021

Using a .ged encoded in UTF8. The double byte character "�" (erroniously) was in the content of a couple of NOTE+CONT tags.

The parser threw an "Invalid format" error for the line and halted processing of the file.

@FlorianCassayre
Copy link
Member

FlorianCassayre commented Sep 8, 2021

Thanks for taking the time to open a ticket.
This seems weird indeed; normally the library will attempt to find a CHAR tag in the header and use it, otherwise it will fall back to UTF-8 as a default charset. That means the parser likely attempted to decode it from UTF-8.

Could you please share the bogus .ged file - or a minimal version of it - to: [email protected]. I'm the only recipient of this inbox and I will delete the file as soon as I can reproduce the issue.

As a side note the parsing errors aren't very helpful in 0.2.0, they are improved in 0.2.1 (not yet released).

@FlorianCassayre FlorianCassayre added the bug Something isn't working label Sep 8, 2021
@FlorianCassayre
Copy link
Member

Hi @brianfreud, do you have any updates on this issue?
There is no hurry but as such I can't reproduce the issue, so I'll leave the ticket open until more precisions are added.

@kefniark
Copy link

Getting similar issue with a gedcom file generated by heredis

ErrorTokenization: Invalid format for line 590644: "3 CONT ** disc\☺ ; * disc\☺ ;; ; * \disc\☺ ;; * \circle\☺ ;; ; * Peronnelle de Lezversault, dame de Brélidy"

@FlorianCassayre
Copy link
Member

Thanks, I'll look into that in a week (currently on vacation).

@FlorianCassayre
Copy link
Member

@kefniark I'm not sure exactly what went wrong; I should definitely improve these error messages to include the detected encoding and other potentially relevant information.
Would you mind sharing the file with which you encountered the issue? You may privately mail it to: [email protected]. Thanks in advance.

@kefniark
Copy link

kefniark commented Aug 9, 2022

No worries, at the end I run into other issues and realize it would probably not work with this library.
The file I have to work with is also too big for embeded usage I wanted (60k+ people).

So at the end, I went another way

@FlorianCassayre
Copy link
Member

FlorianCassayre commented Aug 9, 2022

Thanks for the reply, I understand better.

This package is known to work on large files. For instance the file DYNASTIE.ged (34 MB) contains 65k individuals and loads in about 10 seconds on https://mon.arbre.app. If you are only interested in the JSON representation of the file, this is also possible (see parseGedcom).

This package was in fact designed to address some of the limitations of parse-gedcom; for instance the lack of character encoding detection.
It seems that the source of these issues is that the tokenizer is too strict and will reject inputs that do not align with the specification, although they could be parsed normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants