-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving of name parsing (parse and clean) #18
Comments
Thanks for the great work on this. The |
Thank you for taking on this issue. Here are some more ;-)
|
Here some more names from Meise data:
|
Yes, it is getting better, very good. With Here are cases where the first one is correctly parsed and cleaned, also interpreted as I would expect it to, and in the second case one name is dropped, somehow unexpected: NAMELIST=(
"M. and C. Jaschhof & Project"
"S. Yu. and N. V. Kuznetsov"
)
IFS=''
for text in ${NAMELIST[*]}; do
echo "input: $text"
dwcagent "${text}" | jq -c '.[] | with_entries(select(.value |.!=null))'
done
unset $IFS … gives: {"family":"Jaschhof","given":"M."}
{"family":"Jaschhof","given":"C."}
{"family":"Kuznetsov","given":"N.V."} So here are some “and”-concatenations, I hope the comments make any sense (with “midst-abbreviated spelling”, I mean in the midst of the name list):
|
The example The example The rest of the mixed uses of &/and alongside family members with interlopers I fear will be too challenging to tackle. However, the examples like |
Prematurely closed via a commit message. |
Yes, true, it’s difficult—I’m not sure if it’s the same person (because of the botanical data background)— one S. Yu. Kuznetsov is here as an example: https://www.researchgate.net/profile/S-Kuznetsov-3. So at least a real “life”-example. |
Here are some form dwcagent 3.0.12.0, and botanical names in the context of BGBM Berlin:
|
dwcagent "Galán,P. & Montenegro,S.M."
# [] … gets parsed to be unexpectedly empty in dwcagent Checking it: NAMELIST=(
"Galán,P."
"Montenegro,S.M."
"Galán,P. & Montenegro,S.M."
)
IFS=''
for text in ${NAMELIST[*]}; do
echo -en "\n-----------\ninput: $text\n"
dwcagent "${text}" | jq -c '.[] | with_entries(select(.value |.!=null))'
done
unset $IFS … we get:
|
A few new updates now in v 3.0.13.0:
|
Hi David, sometimes, if there are square brackets in the middle of the names, with probable letters, e.g. The attachment contains many of these examples from Meise name data with square brackets, where they are in the name string itself: I think a good strategy would be to remove the |
We are getting there — almost 😁 … here with dwc_agent: 3.0.14.0 I found some unexpected results:
#!/bin/bash
NAMELIST=(
"C. Be[n]ed. [St]enn" # would expect Stenn
"Abb. No[yr]ey, abb. Faure" # would expect Abb. Faure
"Abbé F[r]. Hy in Ch. Flahault" # would expect dot in Fr.
"An[s]on" # would expect Anson
"Arm. A[n]spach" # would expect dot to keep in Arm.
"Attila Meste[r]há[zy]" # would expect Mesterházy
"Conzatti, [Holned] & Ordó[ñ]er" # would expect parsed: Conzatti<SEP>Holned<SEP>Ordóñer
"Corn. C[oriz]e" # would expect dot to keep
"Dr. B[on]bier" # would expect: Bonbier
"Dr. [B] [B]oiglaender-[T]ekn[]" # would expect parsed: Dr. B Boiglaender-Tekn
"G.[Char][][g][] in F. Crépin - Herbarium Rosarum" # would expect: G.Charg in F. Crépin
"Ga[r]dama[uic][][te]" # would expect: Gardamauicte
# nothing parsed so far the following:
"R. Ba[renda]/J.Willemsen" # is parsed all right the following not
"R. Barendse/[J]. Willemse" # nothing parsed?
"V.[S]. Nikitin - T.I Zhilenko" # nothing parsed?
"L. [G][][trofi][]" # would expect: L. Gtrofi
"B.E. & J.[G]. Juniper" # would expect parsed: B.E.<SEP>J.G. Juniper
# or interpreted parsed: B.E. Juniper<SEP>J.G. Juniper
# is dwcagent "B.E." expected to be parsed or not, I guess not?
"D. Ba[nsk], M. Mathais, R.E.W. jr" # nothing parsed?
"C[o]l" # nothing parsed?
"C.[R]. Gerard" # nothing parsed?
"Comm. K.[j]. Cameron" # nothing parsed?
"L.[G]. Ravaud" # nothing parsed?
"D.[f]. [G]i[lf]illa[n]" # nothing parsed?
"M. Mayor, J. A . Fdez. P[rie]to, C. Fdez. Carvujal" # nothing parsed?
"N.S.[q]. in M.R. Clarke" # nothing parsed?
)
IFS=''
for text in ${NAMELIST[*]}; do
echo -en "-----------\ninput: $text\n"
results=$(dwcagent "${text}")
if [[ "${results-}" == "[]" ]];then
echo "output: $results"
else
echo "$results" | jq -c '.[] | with_entries(select(.value |.!=null))'
fi
done
unset $IFS
|
With the v.3.0.15.0 release, I think we're getting pretty close to accommodating many of the issues identified above. Some were intractable & some meant re-evaluating the strict spec tests. |
Parsing of name particlesIn 3.0.16.0 I parsed an example with comma (real given example) and the interpreted same name without comma, and the results are different, should they not be the same? Using the
…
It is difficult to analyse the name particles — for instance the https://en.wikipedia.org/wiki/Nobiliary_particle or https://en.wikipedia.org/wiki/German_nobility#Nobiliary_particles — and they can describe a place or another family name so to say, I think, e.g.
… so it is difficult to assign the name parts to either …
I think this is not solvable 100%, because there are too many combinations, I just want to mention it here. |
Looks like your variously arranged names, particles and initials:
Confuse the underlying ![]() |
By the way: brackets, this time round parentheses Examples I found from WikiData with parentheses and are candidates that I thought to standardize by -----------
input: Gustav Adolf Ferdinand Eichler (1835-1906)
{"family":"Eichler","given":"Gustav Adolf Ferdinand"}
-----------
input: Søren Sørensen (1873-1926)
{"family":"Sørensen","given":"Søren"}
-----------
input: Georges André (1888–1973)
{"family":"André","given":"Georges"}
-----------
input: Johannes Johannessen (1904-1990)
{"family":"Johannessen","given":"Johannes"}
-----------
input: Helge Buen (1918-2005)
{"family":"Buen","given":"Helge"}
-----------
input: Vlk Valenta (1925-2010)
{"family":"Valenta","given":"Vlk"}
-----------
input: Amalesh Choudhury (bot.)
{"family":"Choudhury","given":"Amalesh"}
-----------
input: Bror Pettersson (botaniker)
{"family":"Pettersson","given":"Bror"}
-----------
input: Robert W. Jones (botanist)
{"family":"Jones","given":"Robert W."}
-----------
input: Thomas Cooper (botanist)
{"family":"Cooper","given":"Thomas"}
-----------
input: Yi Huang (botanist-1)
{"family":"Huang","given":"Yi"}
-----------
input: William Vernon (c. 1666-1711)
{"family":"Vernon","given":"William"}
-----------
input: James Smith (diatomist)
{"family":"Smith","given":"James"}
-----------
input: Josep María Vidal(-Frigola)
{"family":"Vidal","given":"Josep María"}
-----------
input: István Balázs (instruisto)
{"family":"Balázs","given":"István"}
-----------
input: Hildur von Rettig (Lindberg)
{"family":"Rettig","given":"Hildur","particle":"von"}
-----------
input: Inger Kaasa (Magistad)
{"family":"Kaasa","given":"Inger"}
-----------
input: Kai Zhang (mycologist)
{"family":"Zhang","given":"Kai"}
-----------
input: Ting-Ting Zhang (mycologist)
{"family":"Zhang","given":"Ting-Ting"}
-----------
input: Robert J. Ferry (Sr.)
{"family":"Ferry","given":"Robert J."}
-----------
input: Phraya Wanpruekphichan (Thongkham Savetsila)
{"family":"Wanpruekphichan","given":"Phraya"}
-----------
input: Phraya Winitwanandon (To Komet)
{"family":"Winitwanandon","given":"Phraya"}
-----------
input: Maria Pavlovna Nagibina (Tsybulskaya)
{"family":"Nagibina","given":"Maria Pavlovna"}
-----------
input: O. Heylen (-Walraevens)
{"family":"Heylen","given":"O."}
-----------
input: Bill Kasongo (Wa Ngoy Kashiki)
{"family":"Kasongo","given":"Bill"} I did not quite expect that round parenthesis would be removed entirely, and I did expected that it would filter cases out or so and leave some in. One compromise one could think of, is to try to simply add the last parenthesis content to the parsing field |
That was the general idea, yes. Most of the examples you have would seem to support the rationale, though admittedly there are some examples that appear to convey meaning (uncertainty perhaps?) about suffices , nicknames, or other implicit expression of family names / identity. |
I completely agree with this. So perhaps let it that way. I have further analysed the botanist names from WikiData and divided up the cases that appear: most of them are actually different versions of names, squeezed into one line, so to speak. Actually, from a data point of view, a correction would have to be made and two or more names with the same meaning would have to be unravelled. An interesting case are the names with only one letter in the parentheses. Here are some examples of botanical names from WikiData:
Alan (C.) McKay
Anton(i) Wróblewski
Arthur C. (II) Grupe
Adolph(Adolf) Osterwalder Aloysius (Luigi) Meschinelli
1 (…) - ('Wilson') Sze Wing Wong 2 (…) - (Axel) Helge (Svensson) Stenar 3 (…) - (Johan) Fredrik(Friedrich) (Eberhard) Svanlund |
Names of rulers or nobles aso.In our botanical data (https://dr.jacq.org/DR014960) there are also names such as
… the last case should output something, but does not 🤔 … If you take the name „Friedrich August II,König von Sachsen“ strictly, you could understand it, or see it like this:
or (if necessary)
... since August and Friedrich are actually first names in German The same naming problem arises with e.g. „August der Starke“, as https://explore.gnd.network/en/gnd/118505084 contains the standard data, and would be standardised as: „August II, Polen, König“, and it makes parsing very tricky – so in both cases there is no family name in the strict sense, is there? The same might apply to English names aso.. |
Hej-hej,
I’m aware of the difficult task to perform a good parsing and cleaning for all the name list cases out there. So here are some more names that get parsed sometimes and doing the cleaning they get lost (see attachment, dwcagent: 3.0.8.0, I used the wrapper https://github.com/infinite-dao/collector-matching/blob/main/bin/agent_parse4tsv.rb)
In our data of BGBM we often have like a regex name-list-separator
/(, | & )/
, e.g.:Anonymous collector & Humboldt,F.W.H.A. von
gets parsed differently with anonymous and without anonymous:So, there are cases when parsing the particle, that could be improved. A difficult case—and I think it is partly not going to be solved from the parser—is:
Álvarez de Zayas,A., Beurton,C., Díaz,M.A., Dietrich,H., Duharte,Góngora,M.E., Gutiérrez,J., Köhler,E., Á,Leiva,Lepper,L., Rankin,R. & Sánchez,C.
—whereasKöhler,E., Á,Leiva,Lepper,L.,
is actually a mistake on the input part (not on parsing): the input in the source data should be corrected toKöhler,E., Leiva,Á, Lepper,L.,
—anyway, it gets to:Álvarez de Zayas,A.
if the particle got right, it is Alberto Álvarez de Zayas (wikidata.org/wiki/Q13497940)The attached files are names from BGBM and Meise; in the logfile you can look in column
related_parsed_name
andcleaned_index_name_of_empty_result
is the index of a cleaned result that gets empty. Most names are Herbarium things or institutions but there are also some real names, e.g. in Meise:… the 3rd name seems missing:
Attached log files from parsing with wrapper
agent_parse4tsv.rb
(see https://github.com/infinite-dao/collector-matching/tree/main/bin — I hope the column names of the files are self explaining):VHde_doi-10.15468-dl.tued2e
(Virtual Herbarium Germany)) occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv_dwcagent_3.0.8.0.logMeise_doi-10.15468-dl.ax9zkh
) occurrence_recordedBy_eventDate_occurrenceIDs_20230830_parsed.tsv_dwcagent_3.0.8.0.logThe text was updated successfully, but these errors were encountered: