New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

improve first last name split by using language specific rules #17

Open

chrished opened this issue Sep 16, 2022 · 1 comment

Labels

Collaborator

chrished commented Sep 16, 2022

For example for spanish we currently have:

firstname : juan
lastname : gonzalez
middlename : eugenio iglesias

However, the main last name is iglesias (the first last name)

Proposal: use https://nationalize.io to predict which country/language a name is from and implement specific rules for those.

Caveat: For spanish names, sometimes people give just the first lastname and sometimes both. So it is not obvious how to handle it automatically

The text was updated successfully, but these errors were encountered:

chrished added the enhancement label

Owner

f-hafner commented Nov 25, 2022

more generally, improve gender assignment

use census data? https://github.com/bhofstra/diversity_innovation_paradox. they also have some more things on ethnicity which may complement nationalize.io.

moreover, from #21:

firstnames may be only one letter -> we can use a dictionary-syle lookup of the full name, based on the names from genderize we have, and the census names too.
for any given linked subsample, we can use the first name in the other dataset if it is spelled out

f-hafner mentioned this issue

understand low advisor links in some fields, and fix #21

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment