Improve parsing of names that include diacritics #100

tukusejssirs · 2019-10-19T13:55:13Z

As we talked in #95 (from which I simply copied portions to this issue), we should improve the parsing of names that include diacritics (like ľščťžýáíéúäôňďěŕĺöüűő).

As we talked there, Lingua::EN::NameParse (which you use for parsing names) currently does not support parsing names with diacritics. However, Lingua::EN::NameParse has the following notes in its perlpod docs:

FUTURE DIRECTIONS

Define grammar for other languages. Hopefully, all that would be needed is to specify a new module with its own grammar, and inherit all the existing methods. I don't have the knowledge of the naming conventions for non-english languages.

BUGS

Names with accented characters (acute, circumfelx etc) will not be parsed correctly. A work around is to replace the character class [a-z] with \w in the appropriate rules in the grammar tree, but this could lower the accuracy of names based purely on ASCII text.

So, I think for now it would be good enough to use that workaround, but it would be nice (if it is possible) to re-replace the names with their original spelling after parsing, that is:

remove the diacritics (Mária → Maria),
parse the names as usual,
replace the parsed names with their original form (Maria → Mária).

However, it would be much better to implement Lingua::SK::NameParse as it is written in the _Future directions. I’d like to contact Kim Ryan (the dev of Lingua::En::NameParse) if he is interested. Although I can code in Perl a bit, I am not a pro programmer. I could mainly assist in the liguist/algorithm part. Are you willing to help with the coding of this parser? Or you are busy enough with other stuff? :)

The text was updated successfully, but these errors were encountered:

nigelhorne · 2019-10-24T13:30:18Z

Commit 0a8d026 uses Unicode::Diacritic::Strip, though it doesn't yet work, at least not with a test case that I have.

nigelhorne · 2019-11-21T15:27:58Z

I don’t know why the u:d:s doesn’t work with gedcom. All of my test code outside of it works fine. Still investigating.

nigelhorne · 2023-07-31T15:31:47Z

It looks like u:d:s doesn't work with UTF-8 fields from gedcoms (perhaps those only from ACOM). I've recently put in some improvements in ged2site which should permeate here.

nigelhorne · 2024-06-26T00:27:04Z

The code still doesn't handle all diacritics, but it should be better than it was, for both UTF-8 and Unicode.

tukusejssirs mentioned this issue Oct 19, 2019

Building dependencies for carton install fails #95

Closed

nigelhorne self-assigned this Oct 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parsing of names that include diacritics #100

Improve parsing of names that include diacritics #100

tukusejssirs commented Oct 19, 2019

nigelhorne commented Oct 24, 2019

nigelhorne commented Nov 21, 2019

nigelhorne commented Jul 31, 2023

nigelhorne commented Jun 26, 2024

Improve parsing of names that include diacritics #100

Improve parsing of names that include diacritics #100

Comments

tukusejssirs commented Oct 19, 2019

nigelhorne commented Oct 24, 2019

nigelhorne commented Nov 21, 2019

nigelhorne commented Jul 31, 2023

nigelhorne commented Jun 26, 2024