Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving of name parsing (parse and clean) #18

Open
infinite-dao opened this issue Sep 11, 2023 · 19 comments
Open

Improving of name parsing (parse and clean) #18

infinite-dao opened this issue Sep 11, 2023 · 19 comments

Comments

@infinite-dao
Copy link

Hej-hej,

I’m aware of the difficult task to perform a good parsing and cleaning for all the name list cases out there. So here are some more names that get parsed sometimes and doing the cleaning they get lost (see attachment, dwcagent: 3.0.8.0, I used the wrapper https://github.com/infinite-dao/collector-matching/blob/main/bin/agent_parse4tsv.rb)

In our data of BGBM we often have like a regex name-list-separator /(, | & )/, e.g.:

  • Anonymous collector & Humboldt,F.W.H.A. von gets parsed differently with anonymous and without anonymous:
    dwcagent "Anonymous collector & Humboldt,F.W.H.A. von" \
    | jq -c '.[] | with_entries(select(.value |.!=null))'
    {"family":"Humboldt","given":"F.W.H.A."}
    … but without anonymous it is more right but not fully:
    dwcagent "Humboldt,F.W.H.A. von" \
    | jq -c '.[] | with_entries(select(.value |.!=null))'
    {"family":"Humboldt","given":"F. W. H. A. von"}
    … only when the „von“ would be placed before the family name it would get right:
    dwcagent "von Humboldt,F.W.H.A." \
    | jq -c '.[] | with_entries(select(.value |.!=null))'
    {"family":"Humboldt","given":"F.W.H.A.","particle":"von"}

So, there are cases when parsing the particle, that could be improved. A difficult case—and I think it is partly not going to be solved from the parser—is:

  • Álvarez de Zayas,A., Beurton,C., Díaz,M.A., Dietrich,H., Duharte,Góngora,M.E., Gutiérrez,J., Köhler,E., Á,Leiva,Lepper,L., Rankin,R. & Sánchez,C.—whereas Köhler,E., Á,Leiva,Lepper,L., is actually a mistake on the input part (not on parsing): the input in the source data should be corrected to Köhler,E., Leiva,Á, Lepper,L., —anyway, it gets to:
    dwcagent "Álvarez de Zayas,A., Beurton,C., Díaz,M.A., Dietrich,H., \
    Duharte,Góngora,M.E., Gutiérrez,J., Köhler,E., Á,Leiva,Lepper,L., Rankin,R. & Sánchez,C." \
     | jq -c '.[] | with_entries(select(.value |.!=null))'
    {"family":"Zayas","given":"A.","particle":"Álvarez de"}
    {"family":"Beurton","given":"C."}
    {"family":"Díaz","given":"M.A."}
    {"family":"Dietrich","given":"H."}
    {"family":"Duharte","given":"Góngora"}
    {"family":"Gutiérrez","given":"M.E."}
    {"family":"Köhler","given":"J."}
    {"family":"Leiva","given":"Lepper"}
    {"family":"Rankin","given":"L."}
    {"family":"Sánchez","given":"C."}
    … I’m not sure with Álvarez de Zayas,A. if the particle got right, it is Alberto Álvarez de Zayas (wikidata.org/wiki/Q13497940)

The attached files are names from BGBM and Meise; in the logfile you can look in column related_parsed_name and cleaned_index_name_of_empty_result is the index of a cleaned result that gets empty. Most names are Herbarium things or institutions but there are also some real names, e.g. in Meise:

dwcagent "A. Charpin, P. Hainard & R. Salanon"  \
 | jq -c '.[] | with_entries(select(.value |.!=null))'

… the 3rd name seems missing:

{"family":"Charpin","given":"A."}
{"family":"Hainard","given":"P."}

Attached log files from parsing with wrapper agent_parse4tsv.rb (see https://github.com/infinite-dao/collector-matching/tree/main/bin — I hope the column names of the files are self explaining):

@dshorthouse
Copy link
Member

Thanks for the great work on this. The A. Charpin, P. Hainard & R. Salanon case was a too greedy use of 'anon' in a BLACKLIST and that's now fixed. The variable placement of 'von' in the source should also be accommodated now in v3.0.9 just pushed. The Álvarez de Zayas,A. example requires some more investigation.

@infinite-dao
Copy link
Author

infinite-dao commented Sep 13, 2023

Thank you for taking on this issue.

Here are some more ;-)

related_parsed_name after cleaning source string comment
L.E. Bureau at cleaned_0index:0 Bureau,L.E.
Émil Bureau at cleaned_0index:0 Bureau,Émil
E.R. Guaglianone at cleaned_0index:2 Burkart A., Troncoso,N.S., Guaglianone,E.R., Rotman,A., Botta,S. & Buck,H.
P. Classe at cleaned_0index:0 Classe,P. & Gebauer,R.
G. Classen at cleaned_0index:0 Classen,G.
R. Claßen at cleaned_0index:0 Claßen,R. & Hagemann,I.
J.B.L. Companyo at cleaned_0index:0 Companyo,J.B.L.
Farmer Braun at cleaned_0index:0 Farmer Braun
Theodor Magnus Fries at cleaned_0index:0 Fries,Theodor (Thore) Magnus & & al. probably also faulty input
Goetzen Graf von at cleaned_0index:0 Goetzen Graf von & Maire,C.
E.R. Guaglianone at cleaned_0index:0 Guaglianone,E.R. & Múlgura,M.E.
H u. S at cleaned_0index:0 H u. S
H. d. D. T. at cleaned_0index:0 H. d. D. T.
R. Claßen at cleaned_0index:1 Hagemann,I. & Claßen,R.
J. Poel van de at cleaned_0index:3 Hammel,B., Chatrou,L., Pérez,I., Poel van de,J. & Wilschut,R.
Heldreich De at cleaned_0index:0 Heldreich De
De at cleaned_0index:1 Heldreich De & Halácsy,E. von
Theodor Heinrich at cleaned_0index:1 Heldreich,Theodor Heinrich von
I. B. at cleaned_0index:0 I. B.
L. B. at cleaned_0index:0 L. B.
von der at cleaned_0index:1 Marck,J.W.C.T., von der probably also faulty input
Martius. C.F.P. von at cleaned_0index:0 Martius. C.F.P. von (no. Herb. Fl. Bras. 483)
D.B. Poindexter at cleaned_0index:1 Nelson,J.B. & Poindexter,D.B.
G. Classen at cleaned_0index:1 Raadts,E. & Classen,G.
August Leopold von at cleaned_0index:1 Reuss,August Leopold von & Reuss,A.L. von
Theodor Schube at cleaned_0index:0 Schube,Theodor
A. Senoner at cleaned_0index:0 Senoner,A.
D. Stafford at cleaned_0index:0 Stafford,D.
Theodor Strauss at cleaned_0index:0 Strauss,Theodor
Theusch at cleaned_0index:0 Theusch
Theusz at cleaned_0index:0 Theusz
Theusz at cleaned_0index:0 Theusz (no. 517)
Thevenon at cleaned_0index:0 Thevenon
Toward at cleaned_0index:0 Toward
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes de
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes,B.E.E. de
Winter de at cleaned_0index:0 Winter de & Hardy
Theodor Wolf at cleaned_0index:0 Wolf,Theodor

@infinite-dao
Copy link
Author

Here some more names from Meise data:

related_parsed_name after cleaning empty source string comment
J. B. at cleaned_0index:0 ? J. B.
F.M.C. V. at cleaned_0index:0 ?F.M.C. V.
Jean Malvaux sc. at cleaned_0index:0 ?Jean Malvaux sc.
B. K. at cleaned_0index:1 ?P.E.G. & B. K.
A F. at cleaned_0index:0 A F.
A W. at cleaned_0index:0 A W., Herb. François Crépin
A at cleaned_0index:0 A, i J. Kornasiowie
A. B at cleaned_0index:0 A. B
A. B at cleaned_0index:0 A. B[auy] [= Braun A.?]
P.M. A. at cleaned_0index:2 A. Besga, M.A. Domingo, A., P.M. & X. Uribe-Echebarría
A. Blytt in Musée Christiania in F. Crépin at cleaned_0index:0 A. Blytt in Musée Christiania in F. Crépin - Herbarium rosarum OK difficult a collection name of some kind
A. R at cleaned_0index:0 A. R[ac][]
A. R at cleaned_0index:0 A. R[aesdonk]
A. R at cleaned_0index:0 A. R[oc][]
ABaillot at cleaned_0index:0 ABaillot perhaps very abbreviated, or misstake
ABeck at cleaned_0index:0 ABeck perhaps very abbreviated, or misstake
ABecke at cleaned_0index:0 ABecke perhaps very abbreviated, or misstake
ABecre at cleaned_0index:0 ABecre perhaps very abbreviated, or misstake
ABeneschi at cleaned_0index:0 ABeneschi perhaps very abbreviated, or misstake
ABlol at cleaned_0index:0 ABlol perhaps very abbreviated, or misstake
ABunge at cleaned_0index:0 ABunge perhaps very abbreviated, or misstake
AHes at cleaned_0index:0 AHes perhaps very abbreviated, or misstake
AHlvay at cleaned_0index:0 AHlvay perhaps very abbreviated, or misstake
AKessemantt at cleaned_0index:0 AKessemantt perhaps very abbreviated, or misstake
AKranz at cleaned_0index:0 AKranz perhaps very abbreviated, or misstake
ALetourneux at cleaned_0index:0 ALetourneux perhaps very abbreviated, or misstake
ALinae at cleaned_0index:0 ALinae perhaps very abbreviated, or misstake
ALongo at cleaned_0index:0 ALongo perhaps very abbreviated, or misstake
ALux at cleaned_0index:0 ALux perhaps very abbreviated, or misstake
ANlemtin at cleaned_0index:0 ANlemtin perhaps very abbreviated, or misstake
ARothms at cleaned_0index:0 ARothms perhaps very abbreviated, or misstake
ARukimo at cleaned_0index:0 ARukimo perhaps very abbreviated, or misstake
ARusland at cleaned_0index:0 ARusland perhaps very abbreviated, or misstake
ASchumacher at cleaned_0index:0 ASchumacher perhaps very abbreviated, or misstake
ASecri at cleaned_0index:0 ASecri perhaps very abbreviated, or misstake
ASeegreg at cleaned_0index:0 ASeegreg perhaps very abbreviated, or misstake
AVDHangls at cleaned_0index:0 AVDHangls perhaps very abbreviated, or misstake
AValon at cleaned_0index:0 AValon perhaps very abbreviated, or misstake
AVirans at cleaned_0index:0 AVirans perhaps very abbreviated, or misstake
AViranz at cleaned_0index:0 AViranz perhaps very abbreviated, or misstake
AVranz at cleaned_0index:0 AVranz perhaps very abbreviated, or misstake
AWal at cleaned_0index:0 AWal[raven] perhaps very abbreviated, or misstake
AWalcana at cleaned_0index:0 AWalcana perhaps very abbreviated, or misstake
AWariouy at cleaned_0index:0 AWariouy perhaps very abbreviated, or misstake
AWeatherby at cleaned_0index:0 AWeatherby perhaps very abbreviated, or misstake
D. at cleaned_0index:0 Abbé D. [Duyany]
P at cleaned_0index:0 Abbé P[]ant
Ad de at cleaned_0index:0 Ad de [Genssen]
Farmer D. at cleaned_0index:2 Akeroyd J., Brookes S., Farmer D. & Jury S.
Arnaud Ch. at cleaned_0index:0 Arnaud Ch.
Arnold Cl. at cleaned_0index:0 Arnold Cl.
Arnold Fr. at cleaned_0index:0 Arnold Fr.
Becker J. Ph. at cleaned_0index:0 Becker J.Ph.
Bell C.R at cleaned_0index:0 Bell C.R & Bell S.F.
Bequet A.B at cleaned_0index:0 Bequet A.B
Bertolani Ch. at cleaned_0index:0 Bertolani (o) Ch.
Bidgood S. at al. at cleaned_0index:0 Bidgood S. at al.
Binder Em. at cleaned_0index:0 Binder Em.
M.L. Blickenstaff at cleaned_0index:0 Blickenstaff M.L.
Bouillenne Cl. at cleaned_0index:0 Bouillenne Cl.
Granville J.J. De at cleaned_0index:0 Granville J.J. De, Acevedo P., Boyer A., Hollenberg L.
Reichgelt Th. at cleaned_0index:1 van Ooststroom S.J. & Reichgelt Th.

@infinite-dao
Copy link
Author

Yes, it is getting better, very good.

With dwcagent version 3.0.11.0 I have parsed names again and some contain two first names connected with an “and” and only one family name, so some kind of abbreviated spelling, for example: «R. Mizuno & C. W. and L. B. O'Brien», which could be resolved as «R. Mizuno & C. W. O'Brien & L. B. O'Brien», but difficult, as there are other “and”-stringings as in: «Puerto Rico and the Mona Passage & Desecheo Is.», which seems mixed input, and hence unable to resolve correctly.

Here are cases where the first one is correctly parsed and cleaned, also interpreted as I would expect it to, and in the second case one name is dropped, somehow unexpected:

NAMELIST=(
  "M. and C. Jaschhof & Project"
  "S. Yu. and N. V. Kuznetsov"
)
IFS=''
for text in ${NAMELIST[*]};   do
  echo "input: $text"
  dwcagent "${text}"   | jq -c '.[] | with_entries(select(.value |.!=null))'
done
unset $IFS

… gives:
input: M. and C. Jaschhof & Project

{"family":"Jaschhof","given":"M."}
{"family":"Jaschhof","given":"C."}

input: S. Yu. and N. V. Kuznetsov

{"family":"Kuznetsov","given":"N.V."}

So here are some “and”-concatenations, I hope the comments make any sense (with “midst-abbreviated spelling”, I mean in the midst of the name list):

related_parsed_name after cleaning empty source string comment
H. W. at cleaned_0index:5 Baltic Amber & Prussian Fm. & C. and H. W. Hoffeins & C. & H. W. kind of midst-abbreviated spelling
M. S. at cleaned_0index:2 Blackdown Tableland & Expedition Rg. & M. S. and B. J. Moulds kind of midst-abbreviated spelling
R. H. at cleaned_0index:2 C. Liang & W. LaBerge & R. H. and L. D. Beamer kind of midst-abbreviated spelling
C. W. at cleaned_0index:0 C. W. and L. B. O’Brien kind of midst-abbreviated spelling
C. W. at cleaned_0index:0 C. W. and L. B. O'Brien & R. Mizuno kind of midst-abbreviated spelling
C. W. at cleaned_0index:0 C. W. and L. O'Brien kind of midst-abbreviated spelling
C. W. at cleaned_0index:0 C. W. and L. O'Brien & G. Wibmer kind of midst-abbreviated spelling
D.T. Le at cleaned_0index:0 D.T. Le, D.T. Truong, H.Q. Nguyen, N.H. Nguyen, and A.N. Nguyen here the name separator seems as regex: /(,\s+|,\s+and\s+)/
M. V. at cleaned_0index:1 E. A. Yagmur & M. V. and S. V. Nabozhenko & B. Keskin & I. Chigr kind of midst-abbreviated spelling
No at cleaned_0index:2 F. W. and S. K. Gess & No kind of midst-abbreviated spelling
M. V. at cleaned_0index:1 Islahiye District, W & M. V. and S. V. Nabozhenko & B. Keskin kind of midst-abbreviated spelling
team at cleaned_0index:1 J. A McGuire and team correctly cleaned
R. J. B. Rockshop field crew at cleaned_0index:1 Japh Boyce and R. J. B. Rockshop field crew correctly cleaned, one could use the non-cleaned parsed result
J. Cl. at cleaned_0index:0 J. Cl. and P. Gauthier kind of midst-abbreviated spelling
J. H. at cleaned_0index:0 J. H., A. M. and A. W. Skevington kind of midst-abbreviated spelling
A. M. at cleaned_0index:1 J. H., A. M. and A. W. Skevington kind of midst-abbreviated spelling
A.M. J.H. at cleaned_0index:0 J.H., A.M. and A.W. Skevington kind of midst-abbreviated spelling
J. C. R. at cleaned_0index:2 J. P. W. Hall & K. R. Willmott & J. C. R. and J. I. R. Willmott kind of midst-abbreviated spelling
team at cleaned_0index:1 Lamas, Nihei and team correctly cleaned
L. B. at cleaned_0index:0 L. B. and C. W. O' Brien kind of midst-abbreviated spelling
J. H. at cleaned_0index:0 Leg. & J. H., A. W. and A. M. Skevington mixed input, but kind of 3-fold abbreviated spelling
A. W. at cleaned_0index:1 Leg. & J. H., A. W. and A. M. Skevington mixed input, but kind of 3-fold abbreviated spelling
L. E. L. at cleaned_0index:0 L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton kind of midst-abbreviated spelling
M. F. V. at cleaned_0index:1 L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton kind of midst-abbreviated spelling
M. F. V. Skeleton at cleaned_0index:3 L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton kind of midst-abbreviated spelling
M. B. at cleaned_0index:2 Male, WA & Fortescue R. & M. B. and B. J. Moulds kind of midst-abbreviated spelling
Project at cleaned_0index:2 M. and C. Jaschhof & Project correctly parsed and cleaned
McIlwrath Ra. at cleaned_0index:0 McIlwrath Ra. & G. and A. Daniels. R. Eastwood kind of midst-abbreviated spelling
M. J. at cleaned_0index:0 M. J. and M. - L. Penrith kind of abbreviated spelling
M. V. at cleaned_0index:1 M. V. Nabozhenko & M. V. and S. V. Nabozhenko & B. Keskin kind of midst-abbreviated spelling
T. H. at cleaned_0index:1 New Zealand H. B. Mohi Bush Waimarama & T. H. and J. M. Davies & kind of midst-abbreviated spelling
C. W. at cleaned_0index:1 N Metzquititlan & C. W. and L. O'Brien & G. Wibmer kind of midst-abbreviated spelling
A. M. at cleaned_0index:1 Nullagine & A. M. and M. J. Douglas kind of midst-abbreviated spelling
Pedro M. Ruiz-Carranza. Adult at cleaned_0index:0 Pedro M. Ruiz-Carranza. Adult male (ICN 19727), and adult female mixed input, hence unable to resolve correctly
C. W. at cleaned_0index:1 P. N. Braulio Carrillo & C. W. and L. B. O'Brien kind of midst-abbreviated spelling
the Mona Passage at cleaned_0index:1 Puerto Rico and the Mona Passage & Desecheo Is. mixed input, hence unable to resolve correctly
Desecheo Is. at cleaned_0index:2 Puerto Rico and the Mona Passage & Desecheo Is. mixed input, hence unable to resolve correctly
the Mona Passage at cleaned_0index:1 Puerto Rico and the Mona Passage & F. Fisk mixed input, hence unable to resolve correctly
R. L. at cleaned_0index:0 R. L., and B. B. Brown here the name separator seems as regex: /(,\s+|,\s+and\s+)/
C. W. at cleaned_0index:1 R. Mizuno & C. W. and L. B. O'Brien kind of midst-abbreviated spelling
M. S. at cleaned_0index:1 Roper R. & M. S. and B. J. Moulds kind of midst-abbreviated spelling
M. E. at cleaned_0index:1 S. San Pedro Sula & M. E. and P. D. Perkins kind of midst-abbreviated spelling
S. Yu. at cleaned_0index:0 S. Yu. and N. V. Kuznetsov kind of abbreviated spelling
M. V. at cleaned_0index:2 W Antakya, S & M. V. and S. V. Nabozhenko & B. Keskin kind of midst-abbreviated spelling
M. S. at cleaned_0index:1 Wheeny Ck. & M. S. and B. J. Moulds & G. Williams & W. Newham & kind of midst-abbreviated spelling

@dshorthouse
Copy link
Member

The example S. Yu. and N. V. Kuznetsov is a tricky one. Is that a misplaced period after a family name Yu or is it a pair of collectors that share the same family name Kuznetsov like [#<Name family="Kuznetsov" given="S. Yu.">, #<Name family="Kuznetsov" given="N. V.">]. I'd assume it's the former and not the latter, but to accommodate this is one of those examples where making an assumption will break accommodation elsewhere.

The example C. W. and L. O'Brien has been fixed in the working code with a spec test and will be released soon. The late Charlie O'Brien and his very much living wife Lois O'Brien would be pleased.

The rest of the mixed uses of &/and alongside family members with interlopers I fear will be too challenging to tackle. However, the examples like J. H., A. M. and A. W. Skevington should be easy enough to accommodate (though rare). It's not at all common than three family members collect together, but Jeff Skevington and family might also be pleased if I can do this.

@dshorthouse
Copy link
Member

Prematurely closed via a commit message.

@dshorthouse dshorthouse reopened this Sep 18, 2023
@infinite-dao
Copy link
Author

The example S. Yu. and N. V. Kuznetsov is a tricky one. Is that a misplaced period after a family name Yu or is it a pair of collectors that share the same family name Kuznetsov like [#<Name family="Kuznetsov" given="S. Yu.">, #<Name family="Kuznetsov" given="N. V.">]. I'd assume it's the former and not the latter, but to accommodate this is one of those examples where making an assumption will break accommodation elsewhere.

Yes, true, it’s difficult—I’m not sure if it’s the same person (because of the botanical data background)— one S. Yu. Kuznetsov is here as an example: https://www.researchgate.net/profile/S-Kuznetsov-3. So at least a real “life”-example.

@infinite-dao
Copy link
Author

Here are some form dwcagent 3.0.12.0, and botanical names in the context of BGBM Berlin:

related_parsed_name after cleaning empty source string comment
F.A. Marschall v. Bieberstein at cleaned_0index:0 Bieberstein,F.A. Marschall v. unexpectedly dropped from cleaning—is it a ?bug in dwcagent version 3.0.12.0
N.H. Le at cleaned_0index:2 Bollendorff,S., Dang,T.T.H., Le,N.H., Nguyen,G.D., Raab-Straube,E.v. & Truong,B.V. unexpectedly dropped related_parsed_name somehow: should «N.H. Le» not be dropped on cleaning(?)
Goetzen Graf von at cleaned_0index:0 Goetzen Graf von & Maire,C. unexpectedly dropped related_parsed_name—is it regarded as a name too general, like Princess of Bavaria or something?
J. Poel van de at cleaned_0index:3 Hammel,B., Chatrou,L., Pérez,I., Poel van de,J. & Wilschut,R. unexpectedly dropped related_parsed_name
Heldreich De at cleaned_0index:0 Heldreich De better curate source data
De at cleaned_0index:1 Heldreich De & Halácsy,E. von better curate source data
I. B. at cleaned_0index:0 I. B. better curate source data
L. B. at cleaned_0index:0 L. B. better curate source data
von der at cleaned_0index:1 Marck,J.W.C.T., von der is probably Johann Wilhelm Carl Theodor von der Marck
Martius. C.F.P. von at cleaned_0index:0 Martius. C.F.P. von (no. Herb. Fl. Bras. 483) is probably Carl Friedrich Philipp von Martius but family name should be curated, not being abbreviated
D.B. Poindexter at cleaned_0index:1 Nelson,J.B. & Poindexter,D.B. unexpectedly dropped related_parsed_name
August Leopold von at cleaned_0index:1 Reuss,August Leopold von & Reuss,A.L. von difficult to guess the only two names concatenated by «&», probably a wrong parser interpretation—it seems also better to curate source data
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes I guess missing information in source data, better curate source data
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes de I guess missing information in source data, better curate source data
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes,B.E.E. de I guess missing information in source data, better curate source data
Winter de at cleaned_0index:0 Winter de & Hardy I guess missing information in source data, better curate source data
Á E. at cleaned_0index:7 Álvarez de Zayas,A., Beurton,C., Díaz,M.A., Dietrich,H., Duharte,Góngora,M.E., Gutiérrez,J., Köhler,E., Á,Leiva,Lepper,L., Rankin,R. & Sánchez,C. as dicussed previously, wrong input data at «Köhler, E. …», no way to guess it right from parser perspective

@infinite-dao
Copy link
Author

infinite-dao commented Sep 20, 2023

dwcagent "Galán,P. & Montenegro,S.M."
# []

… gets parsed to be unexpectedly empty in dwcagent 3.0.12.0

Checking it:

NAMELIST=(
  "Galán,P."
  "Montenegro,S.M."
  "Galán,P. & Montenegro,S.M."
)
IFS=''
for text in ${NAMELIST[*]};   do
  echo -en "\n-----------\ninput: $text\n"
  dwcagent "${text}"   | jq -c '.[] | with_entries(select(.value |.!=null))'
done
unset $IFS

… we get:

-----------
input: Galán,P.
{"family":"Galán","given":"P."}

-----------
input: Montenegro,S.M.

-----------
input: Galán,P. & Montenegro,S.M.

@dshorthouse
Copy link
Member

A few new updates now in v 3.0.13.0:

Bieberstein,F.A. Marschall v. regression is restored.

Bollendorff,S., Dang,T.T.H., Le,N.H., Nguyen,G.D., Raab-Straube,E.v. & Truong,B.V. now parses. The Le was aggressively cleaned from a blacklist.

Galán,P. & Montenegro,S.M. now parses. Montenegro was in a country blacklist.

Nelson,J.B. & Poindexter,D.B. now works. The greedy regex found index in Poindexter.

@infinite-dao
Copy link
Author

Hi David,

sometimes, if there are square brackets in the middle of the names, with probable letters, e.g. Buto[m]a, then when it is parsed in dwc_agent: 3.0.13.0, it is split into parsed:Buto a and cleaned:Buto A, so an artificial first name is created, in this case. Actually, if the brackets were just ignored, it would be better if Buto[m]a came out Butoa and not Buto A, or Malaisse Matera;Wa[s]terlain to Malaisse Matera<SEP>Waterlain, right?

The attachment contains many of these examples from Meise name data with square brackets, where they are in the name string itself:

I think a good strategy would be to remove the [...] if they are immediately inside a string, between the letters, and then just remove them without inserting a space.

@infinite-dao
Copy link
Author

infinite-dao commented Oct 11, 2023

We are getting there — almost 😁 … here with dwc_agent: 3.0.14.0 I found some unexpected results:

  • often the dot is removed
  • some parsing does not convey results at all
  • [] is difficult to interpret, perhaps just remove it completely and join adjacent characters
#!/bin/bash
NAMELIST=(
  "C. Be[n]ed. [St]enn"       # would expect Stenn
  "Abb. No[yr]ey, abb. Faure" # would expect Abb. Faure
  "Abbé F[r]. Hy in Ch. Flahault" # would expect dot in Fr.
  "An[s]on"                   # would expect Anson
  "Arm. A[n]spach"            # would expect dot to keep in Arm.
  "Attila Meste[r]há[zy]"     # would expect Mesterházy
  "Conzatti, [Holned] & Ordó[ñ]er" # would expect parsed: Conzatti<SEP>Holned<SEP>Ordóñer
  "Corn. C[oriz]e"            # would expect dot to keep
  "Dr. B[on]bier"             # would expect: Bonbier
  "Dr. [B] [B]oiglaender-[T]ekn[]" # would expect parsed: Dr. B Boiglaender-Tekn
  "G.[Char][][g][] in F. Crépin - Herbarium Rosarum" # would expect: G.Charg in F. Crépin
  "Ga[r]dama[uic][][te]" # would expect: Gardamauicte

  # nothing parsed so far the following:
  "R. Ba[renda]/J.Willemsen"  # is parsed all right the following not
  "R. Barendse/[J]. Willemse" # nothing parsed?
  "V.[S]. Nikitin - T.I Zhilenko" # nothing parsed?
  "L. [G][][trofi][]"         # would expect: L. Gtrofi
  "B.E. & J.[G]. Juniper"     # would expect parsed: B.E.<SEP>J.G. Juniper 
      # or interpreted parsed: B.E. Juniper<SEP>J.G. Juniper
      # is dwcagent "B.E." expected to be parsed or not, I guess not?
  "D. Ba[nsk], M. Mathais, R.E.W. jr" # nothing parsed?
  "C[o]l"                     # nothing parsed?
  "C.[R]. Gerard"             # nothing parsed?
  "Comm. K.[j]. Cameron"      # nothing parsed?
  "L.[G]. Ravaud"             # nothing parsed?
  "D.[f]. [G]i[lf]illa[n]"    # nothing parsed?
  "M. Mayor, J. A . Fdez. P[rie]to, C. Fdez. Carvujal" # nothing parsed?
  "N.S.[q]. in M.R. Clarke"   # nothing parsed?
)
IFS=''
for text in ${NAMELIST[*]};   do
  echo -en "-----------\ninput: $text\n"
  results=$(dwcagent "${text}") 
  if [[ "${results-}" == "[]" ]];then
  echo "output: $results"
  else
  echo "$results" | jq -c '.[] | with_entries(select(.value |.!=null))'
  fi
done
unset $IFS
-----------
input: C. Be[n]ed. [St]enn
{"family":"Enn","given":"C. Bened"}
-----------
input: Abb. No[yr]ey, abb. Faure
{"family":"Noyrey","given":"Abb"}
{"family":"Faure","given":"AbB."}
-----------
input: Abbé F[r]. Hy in Ch. Flahault
{"family":"Ch. Flahault","given":"Fr  Hy","particle":"in","title":"Abbé"}
-----------
input: An[s]on
{"family":"Ans"}
-----------
input: Arm. A[n]spach
{"family":"Anspach","given":"Arm"}
-----------
input: Attila Meste[r]há[zy]
{"family":"Mesterh","given":"Attila"}
-----------
input: Conzatti, [Holned] & Ordó[ñ]er
{"family":"Conzatti"}
{"family":"Er","given":"Ord"}
-----------
input: Corn. C[oriz]e
{"family":"Corize","given":"Corn"}
-----------
input: Dr. B[on]bier
{"family":"Bier","given":"B.","title":"Dr."}
-----------
input: Dr. [B] [B]oiglaender-[T]ekn[]
{"family":"Ekn","given":"Oiglaender","title":"Dr."}
-----------
input: G.[Char][][g][] in F. Crépin - Herbarium Rosarum
{"family":"F. Crépin","given":"G.","particle":"in"}
-----------
input: Ga[r]dama[uic][][te]
{"family":"Te","given":"Gardamauic"}
-----------
input: R. Ba[renda]/J.Willemsen
{"family":"Barenda","given":"R."}
{"family":"Willemsen","given":"J."}
-----------
input: R. Barendse/[J]. Willemse
output: []
-----------
input: V.[S]. Nikitin - T.I Zhilenko
output: []
-----------
input: L. [G][][trofi][]
output: []
-----------
input: B.E. & J.[G]. Juniper
output: []
-----------
input: D. Ba[nsk], M. Mathais, R.E.W. jr
output: []
-----------
input: C[o]l
output: []
-----------
input: C.[R]. Gerard
output: []
-----------
input: Comm. K.[j]. Cameron
output: []
-----------
input: L.[G]. Ravaud
output: []
-----------
input: D.[f]. [G]i[lf]illa[n]
output: []
-----------
input: M. Mayor, J. A . Fdez. P[rie]to, C. Fdez. Carvujal
output: []
-----------
input: N.S.[q]. in M.R. Clarke
output: []

@dshorthouse
Copy link
Member

With the v.3.0.15.0 release, I think we're getting pretty close to accommodating many of the issues identified above. Some were intractable & some meant re-evaluating the strict spec tests.

@infinite-dao
Copy link
Author

Parsing of name particles

In 3.0.16.0 I parsed an example with comma (real given example) and the interpreted same name without comma, and the results are different, should they not be the same? Using the dwcagent wrapper I get:

-----------
input: Reyna de Aguilar,M.L.
{"family":"Aguilar","given":"M.L.","particle":"Reyna de"}
-----------
input: M.L. Reyna de Aguilar
{"family":"Aguilar","given":"M. L. Reyna","particle":"de"}

Reyna de Aguilar,M.L. is the the actual example (e.g. from «Flores,J., Montalvo,E.A., Reyna de Aguilar,M.L. & Calderón,M.») and M.L. Reyna de Aguilar is just the reverse of the comma concatenating version, like so:

Reyna de Aguilar,M.L.
………………………………………… ____
  ↘             ↙
    ⋅         ↙
      ⋅     ↙
        ⋅ ↙
        ↙ ⋅
      ↙     ⋅
    ↙         ⋅
  ↙             ↘
____ …………………………………………
M.L. Reyna de Aguilar

It is difficult to analyse the name particles — for instance the https://en.wikipedia.org/wiki/Nobiliary_particle or https://en.wikipedia.org/wiki/German_nobility#Nobiliary_particles — and they can describe a place or another family name so to say, I think, e.g.

  • Richard von Weizsäcker https://www.deutsche-biographie.de/pnd118766570.html?language=en — where Weizsäcker seems to be a family name, litterally Weiz(en) + Sack = wheat + bag, someone who is bagging wheat, so probably from a profession of millers ;-), so in Germany and Austria, „von“ means more or less =descending from (more a real family name to be considered, but it can also be a locality ;-), and „zu“=resident at aso.
  • «De Gracia,J.» → “J. De Gracia” seems more like a place
  • aso.

… so it is difficult to assign the name parts to either …

  • the particle part (and keep it perhaps strictly connected with the parsed name particle part), or to
  • the family or given name part

I think this is not solvable 100%, because there are too many combinations, I just want to mention it here.

@dshorthouse
Copy link
Member

Looks like your variously arranged names, particles and initials:

-----------
input: Reyna de Aguilar,M.L.
{"family":"Aguilar","given":"M.L.","particle":"Reyna de"}
-----------
input: M.L. Reyna de Aguilar
{"family":"Aguilar","given":"M. L. Reyna","particle":"de"}

Confuse the underlying Namae gem, https://github.com/berkmancenter/namae/issues upon which dwc_agent is dependent on.

Screen Shot 2023-11-02 at 1 22 24 PM

@infinite-dao
Copy link
Author

infinite-dao commented Nov 13, 2023

By the way: brackets, this time round parentheses name (...) 😁 — in dwc_agent (3.0.16.0) all names in parentheses disappear, is that the general idea, the concept, that all contents with parentheses are removed?

Examples I found from WikiData with parentheses and are candidates that I thought to standardize by dwc_agent as well, are the following:

-----------
input: Gustav Adolf Ferdinand Eichler (1835-1906)
{"family":"Eichler","given":"Gustav Adolf Ferdinand"}
-----------
input: Søren Sørensen (1873-1926)
{"family":"Sørensen","given":"Søren"}
-----------
input: Georges André (1888–1973)
{"family":"André","given":"Georges"}
-----------
input: Johannes Johannessen (1904-1990)
{"family":"Johannessen","given":"Johannes"}
-----------
input: Helge Buen (1918-2005)
{"family":"Buen","given":"Helge"}
-----------
input: Vlk Valenta (1925-2010)
{"family":"Valenta","given":"Vlk"}
-----------
input: Amalesh Choudhury (bot.)
{"family":"Choudhury","given":"Amalesh"}
-----------
input: Bror Pettersson (botaniker)
{"family":"Pettersson","given":"Bror"}
-----------
input: Robert W. Jones (botanist)
{"family":"Jones","given":"Robert W."}
-----------
input: Thomas Cooper (botanist)
{"family":"Cooper","given":"Thomas"}
-----------
input: Yi Huang (botanist-1)
{"family":"Huang","given":"Yi"}
-----------
input: William Vernon (c. 1666-1711)
{"family":"Vernon","given":"William"}
-----------
input: James Smith (diatomist)
{"family":"Smith","given":"James"}
-----------
input: Josep María Vidal(-Frigola)
{"family":"Vidal","given":"Josep María"}
-----------
input: István Balázs (instruisto)
{"family":"Balázs","given":"István"}
-----------
input: Hildur von Rettig (Lindberg)
{"family":"Rettig","given":"Hildur","particle":"von"}
-----------
input: Inger Kaasa (Magistad)
{"family":"Kaasa","given":"Inger"}
-----------
input: Kai Zhang (mycologist)
{"family":"Zhang","given":"Kai"}
-----------
input: Ting-Ting Zhang (mycologist)
{"family":"Zhang","given":"Ting-Ting"}
-----------
input: Robert J. Ferry (Sr.)
{"family":"Ferry","given":"Robert J."}
-----------
input: Phraya Wanpruekphichan (Thongkham Savetsila)
{"family":"Wanpruekphichan","given":"Phraya"}
-----------
input: Phraya Winitwanandon (To Komet)
{"family":"Winitwanandon","given":"Phraya"}
-----------
input: Maria Pavlovna Nagibina (Tsybulskaya)
{"family":"Nagibina","given":"Maria Pavlovna"}
-----------
input: O. Heylen (-Walraevens)
{"family":"Heylen","given":"O."}
-----------
input: Bill Kasongo (Wa Ngoy Kashiki)
{"family":"Kasongo","given":"Bill"}

I did not quite expect that round parenthesis would be removed entirely, and I did expected that it would filter cases out or so and leave some in. One compromise one could think of, is to try to simply add the last parenthesis content to the parsing field family if it does not contain any dates or terms of occupations like botanist aso. — or shall this be documented that all round parenthesis content is removed?

@dshorthouse
Copy link
Member

By the way: brackets, this time round parentheses name (...) 😁 — in dwc_agent (3.0.16.0) all names in parentheses disappear, is that the general idea, the concept, that all contents with parentheses are removed?

That was the general idea, yes. Most of the examples you have would seem to support the rationale, though admittedly there are some examples that appear to convey meaning (uncertainty perhaps?) about suffices , nicknames, or other implicit expression of family names / identity.

@infinite-dao
Copy link
Author

That was the general idea, yes. Most of the examples you have would seem to support the rationale, though admittedly there are some examples that appear to convey meaning (uncertainty perhaps?) about suffices , nicknames, or other implicit expression of family names / identity.

I completely agree with this. So perhaps let it that way.


I have further analysed the botanist names from WikiData and divided up the cases that appear: most of them are actually different versions of names, squeezed into one line, so to speak. Actually, from a data point of view, a correction would have to be made and two or more names with the same meaning would have to be unravelled. An interesting case are the names with only one letter in the parentheses. Here are some examples of botanical names from WikiData:

  • grep --invert-match --extended-regexp '^[^()]+\(\w{1,3}\W\)[^()]+$' Liste_ursprünglich.txt > Liste_ursprünglich_vermindert_1.txt
  • grep --extended-regexp '^[^()]+\(\w{1,3}\W\)[^()]+$' Liste_ursprünglich.txt > Liste_ursprünglich_vermindert_1_Auszug.txt

Alan (C.) McKay
Carolyn [Caroline] (A.) Young
H.(Kh.) Karis
Kh.(Ch.) G. Kulieva
Yu.(Ju.) E. Petrov
D.(T.) O'Gorman
Robert J. Ferry (Jr.).
Tatiana (Yu.) Gagkaeva

  • grep --invert-match --extended-regexp '^[^()]+\(\w{1}\)[^()]+$' Liste_ursprünglich_vermindert_1.txt > Liste_ursprünglich_vermindert_2.txt
  • grep --extended-regexp '^[^()]+\(\w{1}\)[^()]+$' Liste_ursprünglich_vermindert_1.txt > Liste_ursprünglich_vermindert_2_Auszug.txt

Anton(i) Wróblewski
Dian Min(e) Chang
Jacob Frederic(k) Brenckle

  • grep --invert-match --extended-regexp '^[^()]+\(\w{1,2}\)[^()]+$' Liste_ursprünglich_vermindert_2.txt > Liste_ursprünglich_vermindert_3.txt
  • grep --extended-regexp '^[^()]+\(\w{1,2}\)[^()]+$' Liste_ursprünglich_vermindert_2.txt > Liste_ursprünglich_vermindert_3_Auszug.txt

Arthur C. (II) Grupe
Bruno E.C. (de) Miranda
Cun Ti (Di) Xiang
Davi Mesquita (de) Macedo
Julian(us) Hendrik Molkenboer
Manuel (de) Assunção Diniz
Marcos Antonio de (Jr) Morais
Marcus (de) Melo Teixeira
Priscila Sanjuan (de) Medeiros
You (Yu) Wen Tsui

  • grep --invert-match --extended-regexp '^[^()]+\(\w{1,}\)[^()]+$' Liste_ursprünglich_vermindert_3.txt > Liste_ursprünglich_vermindert_4.txt
  • grep --extended-regexp '^[^()]+\(\w{1,}\)[^()]+$' Liste_ursprünglich_vermindert_3.txt > Liste_ursprünglich_vermindert_4_Auszug.txt
  • grep --extended-regexp '^[^()]+\w\(\w{1,}\)[^()]+$' Liste_ursprünglich_vermindert_4_Auszug.txt

Adolph(Adolf) Osterwalder
Chris(toffel) F.J. Spies
Constantine Demetry(Dmitriev) Sherbakoff
Dénes(Dionisie) Pázmány
Peng Yong(Yun) Zhang
Ze(Tse) Xiang Peng
Zu(Tsu) Tang Yin

Aloysius (Luigi) Meschinelli
Aldworth William (Tommy) Thompson
Tina (Antje) Hofmann
Wen (Wan) Jia Zhu
Yin Tong (Tang) Xie

cat Liste_ursprünglich_vermindert_4.txt | sed -r 's@^[^()]*\([^()]*\)[^()]*$@1 (…) - &@; s@^[^()]*\([^()]*\)[^()]*\([^()]*\)[^()]*$@2 (…) - &@; s@^[^()]*\([^()]*\)[^()]*\([^()]*\)[^()]*\([^()]*\)[^()]*$@3 (…) - &@;' | sort

1 (…) - ('Wilson') Sze Wing Wong
1 (…) - (Alexandre Alexis) George Le Monnier
1 (…) - (Bartolomeo Giacomo) Rinaldo Corradi
1 (…) - (Léon Marie Joseph) Gustave Nicolas
1 (…) - (Antonius) Theodoor Wegelin
1 (…) - (B.) Alfred Steinmann
1 (…) - (M.)F. Weyhe
1 (…) - (Mrs.) Leslie Hofer [Mrs. Vincent W.] Lanfear
1 (…) - (Q.)S.Y. Yeung
1 (…) - (Sister) Little Flower
1 (…) - (Wilhelm) William Winkler
1 (…) - A.H.G. ('Bert') Gerrits van den Ende
1 (…) - Albert (Jacob Josef) Vandevelde
1 (…) - August (François Marie Antoine) Tonglet
1 (…) - Amalesh Choudhury (bot.)
1 (…) - Bror Pettersson (botaniker)
1 (…) - István Balázs (instruisto)
1 (…) - James Smith (diatomist)
1 (…) - Robert W. Jones (botanist)
1 (…) - Ting-Ting Zhang (mycologist)
1 (…) - Yi Huang (botanist-1)
1 (…) - Analy Salles (de Azevedo) Melo
1 (…) - Bill Kasongo (Wa Ngoy Kashiki)
1 (…) - Franz August (`Friedrich') Müller
1 (…) - Friedrich (`Franz') Joseph Schelver
1 (…) - Frédéric-Edouard (`Fritz') Kampmann
1 (…) - Félix de Azara (1742-1821)
1 (…) - G. (of Nancy) Gardet
1 (…) - G.(of Bavaria) Gerber
1 (…) - G.I. (H.J.) Sëmina
1 (…) - Geoffrey S. ('Geoff') Hall
1 (…) - Georges André (1888–1973)
1 (…) - Gustav Adolf Ferdinand Eichler (1835-1906)
1 (…) - Giles E. (St J.) Hardy
1 (…) - H.(of Freiburg) Schmidt
1 (…) - H.J. (`Harry') Hudson
1 (…) - Ion(Ioan,Joan) C. Constantineanu
1 (…) - Jac (N.J.) Gelderblom
1 (…) - James Michael ('Jim') Miller
1 (…) - Jennifer Anne ('Jenny') Davidson
1 (…) - Robert J. Ferry (Sr.)
1 (…) - Robert Leroy ('Bob') Hanrahan
1 (…) - Ronald L. ('Ron') Exeter
1 (…) - Russell J. ('Rusty') Rodriguez
1 (…) - Rüdiger Felix (Ruggero Felice) Solla
1 (…) - Sandra L. ('Sandie') Baldauf
1 (…) - Saun-ichirô (Shun-ichirô) Imamura
1 (…) - Stig (Gunnar Anton) Waldheim
1 (…) - Stip (B.R.) Helleman
1 (…) - Søren Sørensen (1873-1926)
1 (…) - Thomas Cooper (botanist)
1 (…) - Vlk Valenta (1925-2010)
1 (…) - William Vernon (c. 1666-1711)
1 (…) - Yin Chan(Ch'An) Wu
1 (…) - Yvette Berenice ('Tivvy') Harvey

2 (…) - (Axel) Helge (Svensson) Stenar
2 (…) - (Carl) Julius (Adolf) Scharlock
2 (…) - Geoff(rey) (S.) Pegg
2 (…) - (Georg) Emil (Carl Christoph) Schuez
2 (…) - (J.A.A.)M.(H.) Goossens-Fontana
2 (…) - (Ludwig) Bernhard (Ehregott) Schmid
2 (…) - Matt(hew) (J.) Trappe
2 (…) - (Philippe) Victoire (Lévêque) de Vilmorin
2 (…) - (Theodor) Julius (Reinhold) von Schröder

3 (…) - (Johan) Fredrik(Friedrich) (Eberhard) Svanlund

@infinite-dao
Copy link
Author

infinite-dao commented Nov 16, 2023

Names of rulers or nobles aso.

In our botanical data (https://dr.jacq.org/DR014960) there are also names such as
Friedrich August II.,König von Sachsen s.n., and this is currently processed not quite right, there is also a name standardisation in Germany (see https://explore.gnd.network/en/gnd/118917218). If you parse this name in different ways, you get the following result (dwc_agent 3.0.16.0):

-----------
input: Friedrich August II.
{"family":"August","given":"Friedrich","suffix":"II."}
-----------
input: Friedrich August II.,König von Sachsen
{"family":"August","given":"Friedrich","suffix":"II."}
{"family":"Sachsen","given":"König","particle":"von"}
-----------
input: Friedrich August II.,König von Sachsen s.n.
output: []

… the last case should output something, but does not 🤔 …

If you take the name „Friedrich August II,König von Sachsen“ strictly, you could understand it, or see it like this:

{"family":"", "given": "August Friedrich", "suffix": "II.", "title": "König von Sachsen"}

or (if necessary)

{"family": "König von Sachsen", "given": "August Friedrich", "suffix": "II."}

... since August and Friedrich are actually first names in German

The same naming problem arises with e.g. „August der Starke“, as https://explore.gnd.network/en/gnd/118505084 contains the standard data, and would be standardised as: „August II, Polen, König“, and it makes parsing very tricky – so in both cases there is no family name in the strict sense, is there? The same might apply to English names aso..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants