You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we talked in #95 (from which I simply copied portions to this issue), we should improve the parsing of names that include diacritics (like ľščťžýáíéúäôňďěŕĺöüűő).
As we talked there, Lingua::EN::NameParse (which you use for parsing names) currently does not support parsing names with diacritics. However, Lingua::EN::NameParse has the following notes in its perlpod docs:
Define grammar for other languages. Hopefully, all that would be needed is to specify a new module with its own grammar, and inherit all the existing methods. I don't have the knowledge of the naming conventions for non-english languages.
Names with accented characters (acute, circumfelx etc) will not be parsed correctly. A work around is to replace the character class [a-z] with \w in the appropriate rules in the grammar tree, but this could lower the accuracy of names based purely on ASCII text.
So, I think for now it would be good enough to use that workaround, but it would be nice (if it is possible) to re-replace the names with their original spelling after parsing, that is:
remove the diacritics (Mária → Maria),
parse the names as usual,
replace the parsed names with their original form (Maria → Mária).
However, it would be much better to implement Lingua::SK::NameParse as it is written in the _Future directions. I’d like to contact Kim Ryan (the dev of Lingua::En::NameParse) if he is interested. Although I can code in Perl a bit, I am not a pro programmer. I could mainly assist in the liguist/algorithm part. Are you willing to help with the coding of this parser? Or you are busy enough with other stuff? :)
The text was updated successfully, but these errors were encountered:
It looks like u:d:s doesn't work with UTF-8 fields from gedcoms (perhaps those only from ACOM). I've recently put in some improvements in ged2site which should permeate here.
As we talked in #95 (from which I simply copied portions to this issue), we should improve the parsing of names that include diacritics (like
ľščťžýáíéúäôňďěŕĺöüűő
).As we talked there,
Lingua::EN::NameParse
(which you use for parsing names) currently does not support parsing names with diacritics. However,Lingua::EN::NameParse
has the following notes in itsperlpod
docs:So, I think for now it would be good enough to use that workaround, but it would be nice (if it is possible) to re-replace the names with their original spelling after parsing, that is:
Mária
→Maria
),Maria
→Mária
).However, it would be much better to implement
Lingua::SK::NameParse
as it is written in the _Future directions. I’d like to contact Kim Ryan (the dev of Lingua::En::NameParse
) if he is interested. Although I can code in Perl a bit, I am not a pro programmer. I could mainly assist in the liguist/algorithm part. Are you willing to help with the coding of this parser? Or you are busy enough with other stuff? :)The text was updated successfully, but these errors were encountered: