Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing of names that include diacritics #100

Open
tukusejssirs opened this issue Oct 19, 2019 · 4 comments
Open

Improve parsing of names that include diacritics #100

tukusejssirs opened this issue Oct 19, 2019 · 4 comments
Assignees

Comments

@tukusejssirs
Copy link

As we talked in #95 (from which I simply copied portions to this issue), we should improve the parsing of names that include diacritics (like ľščťžýáíéúäôňďěŕĺöüűő).

As we talked there, Lingua::EN::NameParse (which you use for parsing names) currently does not support parsing names with diacritics. However, Lingua::EN::NameParse has the following notes in its perlpod docs:

FUTURE DIRECTIONS

Define grammar for other languages. Hopefully, all that would be needed is to specify a new module with its own grammar, and inherit all the existing methods. I don't have the knowledge of the naming conventions for non-english languages.

BUGS

Names with accented characters (acute, circumfelx etc) will not be parsed correctly. A work around is to replace the character class [a-z] with \w in the appropriate rules in the grammar tree, but this could lower the accuracy of names based purely on ASCII text.

So, I think for now it would be good enough to use that workaround, but it would be nice (if it is possible) to re-replace the names with their original spelling after parsing, that is:

  • remove the diacritics (MáriaMaria),
  • parse the names as usual,
  • replace the parsed names with their original form (MariaMária).

However, it would be much better to implement Lingua::SK::NameParse as it is written in the _Future directions. I’d like to contact Kim Ryan (the dev of Lingua::En::NameParse) if he is interested. Although I can code in Perl a bit, I am not a pro programmer. I could mainly assist in the liguist/algorithm part. Are you willing to help with the coding of this parser? Or you are busy enough with other stuff? :)

@nigelhorne
Copy link
Owner

Commit 0a8d026 uses Unicode::Diacritic::Strip, though it doesn't yet work, at least not with a test case that I have.

@nigelhorne nigelhorne self-assigned this Oct 25, 2019
@nigelhorne
Copy link
Owner

I don’t know why the u:d:s doesn’t work with gedcom. All of my test code outside of it works fine. Still investigating.

@nigelhorne
Copy link
Owner

It looks like u:d:s doesn't work with UTF-8 fields from gedcoms (perhaps those only from ACOM). I've recently put in some improvements in ged2site which should permeate here.

@nigelhorne
Copy link
Owner

The code still doesn't handle all diacritics, but it should be better than it was, for both UTF-8 and Unicode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants