Improving of name parsing (parse and clean) #18

infinite-dao · 2023-09-11T13:24:12Z

Hej-hej,

I’m aware of the difficult task to perform a good parsing and cleaning for all the name list cases out there. So here are some more names that get parsed sometimes and doing the cleaning they get lost (see attachment, dwcagent: 3.0.8.0, I used the wrapper https://github.com/infinite-dao/collector-matching/blob/main/bin/agent_parse4tsv.rb)

In our data of BGBM we often have like a regex name-list-separator /(, | & )/, e.g.:

Anonymous collector & Humboldt,F.W.H.A. von gets parsed differently with anonymous and without anonymous:

dwcagent "Anonymous collector & Humboldt,F.W.H.A. von" \
| jq -c '.[] | with_entries(select(.value |.!=null))'

{"family":"Humboldt","given":"F.W.H.A."}

… but without anonymous it is more right but not fully:

dwcagent "Humboldt,F.W.H.A. von" \
| jq -c '.[] | with_entries(select(.value |.!=null))'

{"family":"Humboldt","given":"F. W. H. A. von"}

… only when the „von“ would be placed before the family name it would get right:

dwcagent "von Humboldt,F.W.H.A." \
| jq -c '.[] | with_entries(select(.value |.!=null))'

{"family":"Humboldt","given":"F.W.H.A.","particle":"von"}

So, there are cases when parsing the particle, that could be improved. A difficult case—and I think it is partly not going to be solved from the parser—is:

Álvarez de Zayas,A., Beurton,C., Díaz,M.A., Dietrich,H., Duharte,Góngora,M.E., Gutiérrez,J., Köhler,E., Á,Leiva,Lepper,L., Rankin,R. & Sánchez,C.

—whereas Köhler,E., Á,Leiva,Lepper,L., is actually a mistake on the input part (not on parsing): the input in the source data should be corrected to Köhler,E., Leiva,Á, Lepper,L., —anyway, it gets to:

dwcagent "Álvarez de Zayas,A., Beurton,C., Díaz,M.A., Dietrich,H., \
Duharte,Góngora,M.E., Gutiérrez,J., Köhler,E., Á,Leiva,Lepper,L., Rankin,R. & Sánchez,C." \
 | jq -c '.[] | with_entries(select(.value |.!=null))'

{"family":"Zayas","given":"A.","particle":"Álvarez de"}
{"family":"Beurton","given":"C."}
{"family":"Díaz","given":"M.A."}
{"family":"Dietrich","given":"H."}
{"family":"Duharte","given":"Góngora"}
{"family":"Gutiérrez","given":"M.E."}
{"family":"Köhler","given":"J."}
{"family":"Leiva","given":"Lepper"}
{"family":"Rankin","given":"L."}
{"family":"Sánchez","given":"C."}

… I’m not sure with Álvarez de Zayas,A. if the particle got right, it is Alberto Álvarez de Zayas (wikidata.org/wiki/Q13497940)

The attached files are names from BGBM and Meise; in the logfile you can look in column related_parsed_name and cleaned_index_name_of_empty_result is the index of a cleaned result that gets empty. Most names are Herbarium things or institutions but there are also some real names, e.g. in Meise:

dwcagent "A. Charpin, P. Hainard & R. Salanon"  \
 | jq -c '.[] | with_entries(select(.value |.!=null))'

… the 3rd name seems missing:

{"family":"Charpin","given":"A."}
{"family":"Hainard","given":"P."}

Attached log files from parsing with wrapper agent_parse4tsv.rb (see https://github.com/infinite-dao/collector-matching/tree/main/bin — I hope the column names of the files are self explaining):

BGBM (VHde_doi-10.15468-dl.tued2e (Virtual Herbarium Germany)) occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv_dwcagent_3.0.8.0.log
Meise (Meise_doi-10.15468-dl.ax9zkh) occurrence_recordedBy_eventDate_occurrenceIDs_20230830_parsed.tsv_dwcagent_3.0.8.0.log

The text was updated successfully, but these errors were encountered:

dshorthouse · 2023-09-11T16:30:28Z

Thanks for the great work on this. The A. Charpin, P. Hainard & R. Salanon case was a too greedy use of 'anon' in a BLACKLIST and that's now fixed. The variable placement of 'von' in the source should also be accommodated now in v3.0.9 just pushed. The Álvarez de Zayas,A. example requires some more investigation.

infinite-dao · 2023-09-13T08:14:13Z

Thank you for taking on this issue.

Here are some more ;-)

attached from BGBM (Virtual Herbarium Germany)
VHde_doi-10.15468-dl.tued2e_occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv_dwcagent_3.0.9.0.log
I also noticed in seldom cases farmer seems actually a name (perhaps in the past more than in the presence)
- e.g. Hugh Farmer (1714-1787) https://en.wikipedia.org/wiki/Hugh_Farmer
- John S. Farmer (e.g. https://daten.digitale-sammlungen.de/~db/0007/bsb00077261/images/index.html a book from 1897)
- so it gets difficult to decide when applying keywords to drop a name upon cleaning or not

related_parsed_name	after cleaning	source string	comment
`L.E. Bureau`	at cleaned_0index:0	`Bureau,L.E.`
`Émil Bureau`	at cleaned_0index:0	`Bureau,Émil`
`E.R. Guaglianone`	at cleaned_0index:2	`Burkart A., Troncoso,N.S., Guaglianone,E.R., Rotman,A., Botta,S. & Buck,H.`
`P. Classe`	at cleaned_0index:0	`Classe,P. & Gebauer,R.`
`G. Classen`	at cleaned_0index:0	`Classen,G.`
`R. Claßen`	at cleaned_0index:0	`Claßen,R. & Hagemann,I.`
`J.B.L. Companyo`	at cleaned_0index:0	`Companyo,J.B.L.`
`Farmer Braun`	at cleaned_0index:0	`Farmer Braun`
`Theodor Magnus Fries`	at cleaned_0index:0	`Fries,Theodor (Thore) Magnus & & al.`	probably also faulty input
`Goetzen Graf von`	at cleaned_0index:0	`Goetzen Graf von & Maire,C.`
`E.R. Guaglianone`	at cleaned_0index:0	`Guaglianone,E.R. & Múlgura,M.E.`
`H u. S`	at cleaned_0index:0	`H u. S`
`H. d. D. T.`	at cleaned_0index:0	`H. d. D. T.`
`R. Claßen`	at cleaned_0index:1	`Hagemann,I. & Claßen,R.`
`J. Poel van de`	at cleaned_0index:3	`Hammel,B., Chatrou,L., Pérez,I., Poel van de,J. & Wilschut,R.`
`Heldreich De`	at cleaned_0index:0	`Heldreich De`
`De`	at cleaned_0index:1	`Heldreich De & Halácsy,E. von`
`Theodor Heinrich`	at cleaned_0index:1	`Heldreich,Theodor Heinrich von`
`I. B.`	at cleaned_0index:0	`I. B.`
`L. B.`	at cleaned_0index:0	`L. B.`
`von der`	at cleaned_0index:1	`Marck,J.W.C.T., von der`	probably also faulty input
`Martius. C.F.P. von`	at cleaned_0index:0	`Martius. C.F.P. von (no. Herb. Fl. Bras. 483)`
`D.B. Poindexter`	at cleaned_0index:1	`Nelson,J.B. & Poindexter,D.B.`
`G. Classen`	at cleaned_0index:1	`Raadts,E. & Classen,G.`
`August Leopold von`	at cleaned_0index:1	`Reuss,August Leopold von & Reuss,A.L. von`
`Theodor Schube`	at cleaned_0index:0	`Schube,Theodor`
`A. Senoner`	at cleaned_0index:0	`Senoner,A.`
`D. Stafford`	at cleaned_0index:0	`Stafford,D.`
`Theodor Strauss`	at cleaned_0index:0	`Strauss,Theodor`
`Theusch`	at cleaned_0index:0	`Theusch`
`Theusz`	at cleaned_0index:0	`Theusz`
`Theusz`	at cleaned_0index:0	`Theusz (no. 517)`
`Thevenon`	at cleaned_0index:0	`Thevenon`
`Toward`	at cleaned_0index:0	`Toward`
`Wilde de`	at cleaned_0index:0	`Wilde de & Wilde-Duyfjes`
`Wilde de`	at cleaned_0index:0	`Wilde de & Wilde-Duyfjes de`
`Wilde de`	at cleaned_0index:0	`Wilde de & Wilde-Duyfjes,B.E.E. de`
`Winter de`	at cleaned_0index:0	`Winter de & Hardy`
`Theodor Wolf`	at cleaned_0index:0	`Wolf,Theodor`

infinite-dao · 2023-09-13T13:27:34Z

Here some more names from Meise data:

occurrence_recordedBy_eventDate_occurrenceIDs_20230830_parsed.tsv_dwcagent_3.0.9.0.log

related_parsed_name	after cleaning empty	source string	comment
`J. B.`	at cleaned_0index:0	`? J. B.`
`F.M.C. V.`	at cleaned_0index:0	`?F.M.C. V.`
`Jean Malvaux sc.`	at cleaned_0index:0	`?Jean Malvaux sc.`
`B. K.`	at cleaned_0index:1	`?P.E.G. & B. K.`
`A F.`	at cleaned_0index:0	`A F.`
`A W.`	at cleaned_0index:0	`A W., Herb. François Crépin`
`A`	at cleaned_0index:0	`A, i J. Kornasiowie`
`A. B`	at cleaned_0index:0	`A. B`
`A. B`	at cleaned_0index:0	`A. B[auy] [= Braun A.?]`
`P.M. A.`	at cleaned_0index:2	`A. Besga, M.A. Domingo, A., P.M. & X. Uribe-Echebarría`
`A. Blytt in Musée Christiania in F. Crépin`	at cleaned_0index:0	`A. Blytt in Musée Christiania in F. Crépin - Herbarium rosarum`	OK difficult a collection name of some kind
`A. R`	at cleaned_0index:0	`A. R[ac][]`
`A. R`	at cleaned_0index:0	`A. R[aesdonk]`
`A. R`	at cleaned_0index:0	`A. R[oc][]`
`ABaillot`	at cleaned_0index:0	`ABaillot`	perhaps very abbreviated, or misstake
`ABeck`	at cleaned_0index:0	`ABeck`	perhaps very abbreviated, or misstake
`ABecke`	at cleaned_0index:0	`ABecke`	perhaps very abbreviated, or misstake
`ABecre`	at cleaned_0index:0	`ABecre`	perhaps very abbreviated, or misstake
`ABeneschi`	at cleaned_0index:0	`ABeneschi`	perhaps very abbreviated, or misstake
`ABlol`	at cleaned_0index:0	`ABlol`	perhaps very abbreviated, or misstake
`ABunge`	at cleaned_0index:0	`ABunge`	perhaps very abbreviated, or misstake
`AHes`	at cleaned_0index:0	`AHes`	perhaps very abbreviated, or misstake
`AHlvay`	at cleaned_0index:0	`AHlvay`	perhaps very abbreviated, or misstake
`AKessemantt`	at cleaned_0index:0	`AKessemantt`	perhaps very abbreviated, or misstake
`AKranz`	at cleaned_0index:0	`AKranz`	perhaps very abbreviated, or misstake
`ALetourneux`	at cleaned_0index:0	`ALetourneux`	perhaps very abbreviated, or misstake
`ALinae`	at cleaned_0index:0	`ALinae`	perhaps very abbreviated, or misstake
`ALongo`	at cleaned_0index:0	`ALongo`	perhaps very abbreviated, or misstake
`ALux`	at cleaned_0index:0	`ALux`	perhaps very abbreviated, or misstake
`ANlemtin`	at cleaned_0index:0	`ANlemtin`	perhaps very abbreviated, or misstake
`ARothms`	at cleaned_0index:0	`ARothms`	perhaps very abbreviated, or misstake
`ARukimo`	at cleaned_0index:0	`ARukimo`	perhaps very abbreviated, or misstake
`ARusland`	at cleaned_0index:0	`ARusland`	perhaps very abbreviated, or misstake
`ASchumacher`	at cleaned_0index:0	`ASchumacher`	perhaps very abbreviated, or misstake
`ASecri`	at cleaned_0index:0	`ASecri`	perhaps very abbreviated, or misstake
`ASeegreg`	at cleaned_0index:0	`ASeegreg`	perhaps very abbreviated, or misstake
`AVDHangls`	at cleaned_0index:0	`AVDHangls`	perhaps very abbreviated, or misstake
`AValon`	at cleaned_0index:0	`AValon`	perhaps very abbreviated, or misstake
`AVirans`	at cleaned_0index:0	`AVirans`	perhaps very abbreviated, or misstake
`AViranz`	at cleaned_0index:0	`AViranz`	perhaps very abbreviated, or misstake
`AVranz`	at cleaned_0index:0	`AVranz`	perhaps very abbreviated, or misstake
`AWal`	at cleaned_0index:0	`AWal[raven]`	perhaps very abbreviated, or misstake
`AWalcana`	at cleaned_0index:0	`AWalcana`	perhaps very abbreviated, or misstake
`AWariouy`	at cleaned_0index:0	`AWariouy`	perhaps very abbreviated, or misstake
`AWeatherby`	at cleaned_0index:0	`AWeatherby`	perhaps very abbreviated, or misstake
`D.`	at cleaned_0index:0	`Abbé D. [Duyany]`
`P`	at cleaned_0index:0	`Abbé P[]ant`
`Ad de`	at cleaned_0index:0	`Ad de [Genssen]`
`Farmer D.`	at cleaned_0index:2	`Akeroyd J., Brookes S., Farmer D. & Jury S.`
`Arnaud Ch.`	at cleaned_0index:0	`Arnaud Ch.`
`Arnold Cl.`	at cleaned_0index:0	`Arnold Cl.`
`Arnold Fr.`	at cleaned_0index:0	`Arnold Fr.`
`Becker J. Ph.`	at cleaned_0index:0	`Becker J.Ph.`
`Bell C.R`	at cleaned_0index:0	`Bell C.R & Bell S.F.`
`Bequet A.B`	at cleaned_0index:0	`Bequet A.B`
`Bertolani Ch.`	at cleaned_0index:0	`Bertolani (o) Ch.`
`Bidgood S. at al.`	at cleaned_0index:0	`Bidgood S. at al.`
`Binder Em.`	at cleaned_0index:0	`Binder Em.`
`M.L. Blickenstaff`	at cleaned_0index:0	`Blickenstaff M.L.`
`Bouillenne Cl.`	at cleaned_0index:0	`Bouillenne Cl.`
`Granville J.J. De`	at cleaned_0index:0	`Granville J.J. De, Acevedo P., Boyer A., Hollenberg L.`
`Reichgelt Th.`	at cleaned_0index:1	`van Ooststroom S.J. & Reichgelt Th.`

infinite-dao · 2023-09-18T12:11:52Z

Yes, it is getting better, very good.

With dwcagent version 3.0.11.0 I have parsed names again and some contain two first names connected with an “and” and only one family name, so some kind of abbreviated spelling, for example: «R. Mizuno & C. W. and L. B. O'Brien», which could be resolved as «R. Mizuno & C. W. O'Brien & L. B. O'Brien», but difficult, as there are other “and”-stringings as in: «Puerto Rico and the Mona Passage & Desecheo Is.», which seems mixed input, and hence unable to resolve correctly.

Here are cases where the first one is correctly parsed and cleaned, also interpreted as I would expect it to, and in the second case one name is dropped, somehow unexpected:

NAMELIST=(
  "M. and C. Jaschhof & Project"
  "S. Yu. and N. V. Kuznetsov"
)
IFS=''
for text in ${NAMELIST[*]};   do
  echo "input: $text"
  dwcagent "${text}"   | jq -c '.[] | with_entries(select(.value |.!=null))'
done
unset $IFS

… gives:
input: M. and C. Jaschhof & Project

{"family":"Jaschhof","given":"M."}
{"family":"Jaschhof","given":"C."}

input: S. Yu. and N. V. Kuznetsov

{"family":"Kuznetsov","given":"N.V."}

So here are some “and”-concatenations, I hope the comments make any sense (with “midst-abbreviated spelling”, I mean in the midst of the name list):

related_parsed_name	after cleaning empty	source string	comment
`H. W.`	at cleaned_0index:5	`Baltic Amber & Prussian Fm. & C. and H. W. Hoffeins & C. & H. W.`	kind of midst-abbreviated spelling
`M. S.`	at cleaned_0index:2	`Blackdown Tableland & Expedition Rg. & M. S. and B. J. Moulds`	kind of midst-abbreviated spelling
`R. H.`	at cleaned_0index:2	`C. Liang & W. LaBerge & R. H. and L. D. Beamer`	kind of midst-abbreviated spelling
`C. W.`	at cleaned_0index:0	`C. W. and L. B. O’Brien`	kind of midst-abbreviated spelling
`C. W.`	at cleaned_0index:0	`C. W. and L. B. O'Brien & R. Mizuno`	kind of midst-abbreviated spelling
`C. W.`	at cleaned_0index:0	`C. W. and L. O'Brien`	kind of midst-abbreviated spelling
`C. W.`	at cleaned_0index:0	`C. W. and L. O'Brien & G. Wibmer`	kind of midst-abbreviated spelling
`D.T. Le`	at cleaned_0index:0	`D.T. Le, D.T. Truong, H.Q. Nguyen, N.H. Nguyen, and A.N. Nguyen`	here the name separator seems as regex: `/(,\s+\|,\s+and\s+)/`
`M. V.`	at cleaned_0index:1	`E. A. Yagmur & M. V. and S. V. Nabozhenko & B. Keskin & I. Chigr`	kind of midst-abbreviated spelling
`No`	at cleaned_0index:2	`F. W. and S. K. Gess & No`	kind of midst-abbreviated spelling
`M. V.`	at cleaned_0index:1	`Islahiye District, W & M. V. and S. V. Nabozhenko & B. Keskin`	kind of midst-abbreviated spelling
`team`	at cleaned_0index:1	`J. A McGuire and team`	correctly cleaned
`R. J. B. Rockshop field crew`	at cleaned_0index:1	`Japh Boyce and R. J. B. Rockshop field crew`	correctly cleaned, one could use the non-cleaned parsed result
`J. Cl.`	at cleaned_0index:0	`J. Cl. and P. Gauthier`	kind of midst-abbreviated spelling
`J. H.`	at cleaned_0index:0	`J. H., A. M. and A. W. Skevington`	kind of midst-abbreviated spelling
`A. M.`	at cleaned_0index:1	`J. H., A. M. and A. W. Skevington`	kind of midst-abbreviated spelling
`A.M. J.H.`	at cleaned_0index:0	`J.H., A.M. and A.W. Skevington`	kind of midst-abbreviated spelling
`J. C. R.`	at cleaned_0index:2	`J. P. W. Hall & K. R. Willmott & J. C. R. and J. I. R. Willmott`	kind of midst-abbreviated spelling
`team`	at cleaned_0index:1	`Lamas, Nihei and team`	correctly cleaned
`L. B.`	at cleaned_0index:0	`L. B. and C. W. O' Brien`	kind of midst-abbreviated spelling
`J. H.`	at cleaned_0index:0	`Leg. & J. H., A. W. and A. M. Skevington`	mixed input, but kind of 3-fold abbreviated spelling
`A. W.`	at cleaned_0index:1	`Leg. & J. H., A. W. and A. M. Skevington`	mixed input, but kind of 3-fold abbreviated spelling
`L. E. L.`	at cleaned_0index:0	`L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton`	kind of midst-abbreviated spelling
`M. F. V.`	at cleaned_0index:1	`L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton`	kind of midst-abbreviated spelling
`M. F. V. Skeleton`	at cleaned_0index:3	`L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton`	kind of midst-abbreviated spelling
`M. B.`	at cleaned_0index:2	`Male, WA & Fortescue R. & M. B. and B. J. Moulds`	kind of midst-abbreviated spelling
`Project`	at cleaned_0index:2	`M. and C. Jaschhof & Project`	correctly parsed and cleaned
`McIlwrath Ra.`	at cleaned_0index:0	`McIlwrath Ra. & G. and A. Daniels. R. Eastwood`	kind of midst-abbreviated spelling
`M. J.`	at cleaned_0index:0	`M. J. and M. - L. Penrith`	kind of abbreviated spelling
`M. V.`	at cleaned_0index:1	`M. V. Nabozhenko & M. V. and S. V. Nabozhenko & B. Keskin`	kind of midst-abbreviated spelling
`T. H.`	at cleaned_0index:1	`New Zealand H. B. Mohi Bush Waimarama & T. H. and J. M. Davies &`	kind of midst-abbreviated spelling
`C. W.`	at cleaned_0index:1	`N Metzquititlan & C. W. and L. O'Brien & G. Wibmer`	kind of midst-abbreviated spelling
`A. M.`	at cleaned_0index:1	`Nullagine & A. M. and M. J. Douglas`	kind of midst-abbreviated spelling
`Pedro M. Ruiz-Carranza. Adult`	at cleaned_0index:0	`Pedro M. Ruiz-Carranza. Adult male (ICN 19727), and adult female`	mixed input, hence unable to resolve correctly
`C. W.`	at cleaned_0index:1	`P. N. Braulio Carrillo & C. W. and L. B. O'Brien`	kind of midst-abbreviated spelling
`the Mona Passage`	at cleaned_0index:1	`Puerto Rico and the Mona Passage & Desecheo Is.`	mixed input, hence unable to resolve correctly
`Desecheo Is.`	at cleaned_0index:2	`Puerto Rico and the Mona Passage & Desecheo Is.`	mixed input, hence unable to resolve correctly
`the Mona Passage`	at cleaned_0index:1	`Puerto Rico and the Mona Passage & F. Fisk`	mixed input, hence unable to resolve correctly
`R. L.`	at cleaned_0index:0	`R. L., and B. B. Brown`	here the name separator seems as regex: `/(,\s+\|,\s+and\s+)/`
`C. W.`	at cleaned_0index:1	`R. Mizuno & C. W. and L. B. O'Brien`	kind of midst-abbreviated spelling
`M. S.`	at cleaned_0index:1	`Roper R. & M. S. and B. J. Moulds`	kind of midst-abbreviated spelling
`M. E.`	at cleaned_0index:1	`S. San Pedro Sula & M. E. and P. D. Perkins`	kind of midst-abbreviated spelling
`S. Yu.`	at cleaned_0index:0	`S. Yu. and N. V. Kuznetsov`	kind of abbreviated spelling
`M. V.`	at cleaned_0index:2	`W Antakya, S & M. V. and S. V. Nabozhenko & B. Keskin`	kind of midst-abbreviated spelling
`M. S.`	at cleaned_0index:1	`Wheeny Ck. & M. S. and B. J. Moulds & G. Williams & W. Newham &`	kind of midst-abbreviated spelling

dshorthouse · 2023-09-18T23:40:39Z

The example S. Yu. and N. V. Kuznetsov is a tricky one. Is that a misplaced period after a family name Yu or is it a pair of collectors that share the same family name Kuznetsov like [#<Name family="Kuznetsov" given="S. Yu.">, #<Name family="Kuznetsov" given="N. V.">]. I'd assume it's the former and not the latter, but to accommodate this is one of those examples where making an assumption will break accommodation elsewhere.

The example C. W. and L. O'Brien has been fixed in the working code with a spec test and will be released soon. The late Charlie O'Brien and his very much living wife Lois O'Brien would be pleased.

The rest of the mixed uses of &/and alongside family members with interlopers I fear will be too challenging to tackle. However, the examples like J. H., A. M. and A. W. Skevington should be easy enough to accommodate (though rare). It's not at all common than three family members collect together, but Jeff Skevington and family might also be pleased if I can do this.

dshorthouse · 2023-09-18T23:51:27Z

Prematurely closed via a commit message.

infinite-dao · 2023-09-20T07:43:03Z

The example S. Yu. and N. V. Kuznetsov is a tricky one. Is that a misplaced period after a family name Yu or is it a pair of collectors that share the same family name Kuznetsov like [#<Name family="Kuznetsov" given="S. Yu.">, #<Name family="Kuznetsov" given="N. V.">]. I'd assume it's the former and not the latter, but to accommodate this is one of those examples where making an assumption will break accommodation elsewhere.

Yes, true, it’s difficult—I’m not sure if it’s the same person (because of the botanical data background)— one S. Yu. Kuznetsov is here as an example: https://www.researchgate.net/profile/S-Kuznetsov-3. So at least a real “life”-example.

infinite-dao · 2023-09-20T09:11:19Z

Here are some form dwcagent 3.0.12.0, and botanical names in the context of BGBM Berlin:

related_parsed_name	after cleaning empty	source string	comment
`F.A. Marschall v. Bieberstein`	at cleaned_0index:0	`Bieberstein,F.A. Marschall v.`	unexpectedly dropped from cleaning—is it a ?bug in dwcagent version 3.0.12.0
`N.H. Le`	at cleaned_0index:2	`Bollendorff,S., Dang,T.T.H., Le,N.H., Nguyen,G.D., Raab-Straube,E.v. & Truong,B.V.`	unexpectedly dropped related_parsed_name somehow: should «N.H. Le» not be dropped on cleaning(?)
`Goetzen Graf von`	at cleaned_0index:0	`Goetzen Graf von & Maire,C.`	unexpectedly dropped related_parsed_name—is it regarded as a name too general, like Princess of Bavaria or something?
`J. Poel van de`	at cleaned_0index:3	`Hammel,B., Chatrou,L., Pérez,I., Poel van de,J. & Wilschut,R.`	unexpectedly dropped related_parsed_name
`Heldreich De`	at cleaned_0index:0	`Heldreich De`	better curate source data
`De`	at cleaned_0index:1	`Heldreich De & Halácsy,E. von`	better curate source data
`I. B.`	at cleaned_0index:0	`I. B.`	better curate source data
`L. B.`	at cleaned_0index:0	`L. B.`	better curate source data
`von der`	at cleaned_0index:1	`Marck,J.W.C.T., von der`	is probably `Johann Wilhelm Carl Theodor von der Marck`
`Martius. C.F.P. von`	at cleaned_0index:0	`Martius. C.F.P. von (no. Herb. Fl. Bras. 483)`	is probably `Carl Friedrich Philipp von Martius` but family name should be curated, not being abbreviated
`D.B. Poindexter`	at cleaned_0index:1	`Nelson,J.B. & Poindexter,D.B.`	unexpectedly dropped related_parsed_name
`August Leopold von`	at cleaned_0index:1	`Reuss,August Leopold von & Reuss,A.L. von`	difficult to guess the only two names concatenated by «&», probably a wrong parser interpretation—it seems also better to curate source data
`Wilde de`	at cleaned_0index:0	`Wilde de & Wilde-Duyfjes`	I guess missing information in source data, better curate source data
`Wilde de`	at cleaned_0index:0	`Wilde de & Wilde-Duyfjes de`	I guess missing information in source data, better curate source data
`Wilde de`	at cleaned_0index:0	`Wilde de & Wilde-Duyfjes,B.E.E. de`	I guess missing information in source data, better curate source data
`Winter de`	at cleaned_0index:0	`Winter de & Hardy`	I guess missing information in source data, better curate source data
`Á E.`	at cleaned_0index:7	`Álvarez de Zayas,A., Beurton,C., Díaz,M.A., Dietrich,H., Duharte,Góngora,M.E., Gutiérrez,J., Köhler,E., Á,Leiva,Lepper,L., Rankin,R. & Sánchez,C.`	as dicussed previously, wrong input data at «Köhler, E. …», no way to guess it right from parser perspective

infinite-dao · 2023-09-20T10:52:30Z

dwcagent "Galán,P. & Montenegro,S.M."
# []

… gets parsed to be unexpectedly empty in dwcagent 3.0.12.0

Checking it:

NAMELIST=(
  "Galán,P."
  "Montenegro,S.M."
  "Galán,P. & Montenegro,S.M."
)
IFS=''
for text in ${NAMELIST[*]};   do
  echo -en "\n-----------\ninput: $text\n"
  dwcagent "${text}"   | jq -c '.[] | with_entries(select(.value |.!=null))'
done
unset $IFS

… we get:

-----------
input: Galán,P.
{"family":"Galán","given":"P."}

-----------
input: Montenegro,S.M.

-----------
input: Galán,P. & Montenegro,S.M.

dshorthouse · 2023-09-23T14:38:49Z

A few new updates now in v 3.0.13.0:

Bieberstein,F.A. Marschall v. regression is restored.

Bollendorff,S., Dang,T.T.H., Le,N.H., Nguyen,G.D., Raab-Straube,E.v. & Truong,B.V. now parses. The Le was aggressively cleaned from a blacklist.

Galán,P. & Montenegro,S.M. now parses. Montenegro was in a country blacklist.

Nelson,J.B. & Poindexter,D.B. now works. The greedy regex found index in Poindexter.

infinite-dao · 2023-10-09T13:35:38Z

Hi David,

sometimes, if there are square brackets in the middle of the names, with probable letters, e.g. Buto[m]a, then when it is parsed in dwc_agent: 3.0.13.0, it is split into parsed:Buto a and cleaned:Buto A, so an artificial first name is created, in this case. Actually, if the brackets were just ignored, it would be better if Buto[m]a came out Butoa and not Buto A, or Malaisse Matera;Wa[s]terlain to Malaisse Matera<SEP>Waterlain, right?

The attachment contains many of these examples from Meise name data with square brackets, where they are in the name string itself:

source_names_with_brackets_inbetween.md

I think a good strategy would be to remove the [...] if they are immediately inside a string, between the letters, and then just remove them without inserting a space.

infinite-dao · 2023-10-11T09:27:39Z

We are getting there — almost 😁 … here with dwc_agent: 3.0.14.0 I found some unexpected results:

often the dot is removed
some parsing does not convey results at all
[] is difficult to interpret, perhaps just remove it completely and join adjacent characters

#!/bin/bash
NAMELIST=(
  "C. Be[n]ed. [St]enn"       # would expect Stenn
  "Abb. No[yr]ey, abb. Faure" # would expect Abb. Faure
  "Abbé F[r]. Hy in Ch. Flahault" # would expect dot in Fr.
  "An[s]on"                   # would expect Anson
  "Arm. A[n]spach"            # would expect dot to keep in Arm.
  "Attila Meste[r]há[zy]"     # would expect Mesterházy
  "Conzatti, [Holned] & Ordó[ñ]er" # would expect parsed: Conzatti<SEP>Holned<SEP>Ordóñer
  "Corn. C[oriz]e"            # would expect dot to keep
  "Dr. B[on]bier"             # would expect: Bonbier
  "Dr. [B] [B]oiglaender-[T]ekn[]" # would expect parsed: Dr. B Boiglaender-Tekn
  "G.[Char][][g][] in F. Crépin - Herbarium Rosarum" # would expect: G.Charg in F. Crépin
  "Ga[r]dama[uic][][te]" # would expect: Gardamauicte

  # nothing parsed so far the following:
  "R. Ba[renda]/J.Willemsen"  # is parsed all right the following not
  "R. Barendse/[J]. Willemse" # nothing parsed?
  "V.[S]. Nikitin - T.I Zhilenko" # nothing parsed?
  "L. [G][][trofi][]"         # would expect: L. Gtrofi
  "B.E. & J.[G]. Juniper"     # would expect parsed: B.E.<SEP>J.G. Juniper 
      # or interpreted parsed: B.E. Juniper<SEP>J.G. Juniper
      # is dwcagent "B.E." expected to be parsed or not, I guess not?
  "D. Ba[nsk], M. Mathais, R.E.W. jr" # nothing parsed?
  "C[o]l"                     # nothing parsed?
  "C.[R]. Gerard"             # nothing parsed?
  "Comm. K.[j]. Cameron"      # nothing parsed?
  "L.[G]. Ravaud"             # nothing parsed?
  "D.[f]. [G]i[lf]illa[n]"    # nothing parsed?
  "M. Mayor, J. A . Fdez. P[rie]to, C. Fdez. Carvujal" # nothing parsed?
  "N.S.[q]. in M.R. Clarke"   # nothing parsed?
)
IFS=''
for text in ${NAMELIST[*]};   do
  echo -en "-----------\ninput: $text\n"
  results=$(dwcagent "${text}") 
  if [[ "${results-}" == "[]" ]];then
  echo "output: $results"
  else
  echo "$results" | jq -c '.[] | with_entries(select(.value |.!=null))'
  fi
done
unset $IFS

-----------
input: C. Be[n]ed. [St]enn
{"family":"Enn","given":"C. Bened"}
-----------
input: Abb. No[yr]ey, abb. Faure
{"family":"Noyrey","given":"Abb"}
{"family":"Faure","given":"AbB."}
-----------
input: Abbé F[r]. Hy in Ch. Flahault
{"family":"Ch. Flahault","given":"Fr  Hy","particle":"in","title":"Abbé"}
-----------
input: An[s]on
{"family":"Ans"}
-----------
input: Arm. A[n]spach
{"family":"Anspach","given":"Arm"}
-----------
input: Attila Meste[r]há[zy]
{"family":"Mesterh","given":"Attila"}
-----------
input: Conzatti, [Holned] & Ordó[ñ]er
{"family":"Conzatti"}
{"family":"Er","given":"Ord"}
-----------
input: Corn. C[oriz]e
{"family":"Corize","given":"Corn"}
-----------
input: Dr. B[on]bier
{"family":"Bier","given":"B.","title":"Dr."}
-----------
input: Dr. [B] [B]oiglaender-[T]ekn[]
{"family":"Ekn","given":"Oiglaender","title":"Dr."}
-----------
input: G.[Char][][g][] in F. Crépin - Herbarium Rosarum
{"family":"F. Crépin","given":"G.","particle":"in"}
-----------
input: Ga[r]dama[uic][][te]
{"family":"Te","given":"Gardamauic"}
-----------
input: R. Ba[renda]/J.Willemsen
{"family":"Barenda","given":"R."}
{"family":"Willemsen","given":"J."}
-----------
input: R. Barendse/[J]. Willemse
output: []
-----------
input: V.[S]. Nikitin - T.I Zhilenko
output: []
-----------
input: L. [G][][trofi][]
output: []
-----------
input: B.E. & J.[G]. Juniper
output: []
-----------
input: D. Ba[nsk], M. Mathais, R.E.W. jr
output: []
-----------
input: C[o]l
output: []
-----------
input: C.[R]. Gerard
output: []
-----------
input: Comm. K.[j]. Cameron
output: []
-----------
input: L.[G]. Ravaud
output: []
-----------
input: D.[f]. [G]i[lf]illa[n]
output: []
-----------
input: M. Mayor, J. A . Fdez. P[rie]to, C. Fdez. Carvujal
output: []
-----------
input: N.S.[q]. in M.R. Clarke
output: []

dshorthouse · 2023-10-14T04:23:14Z

With the v.3.0.15.0 release, I think we're getting pretty close to accommodating many of the issues identified above. Some were intractable & some meant re-evaluating the strict spec tests.

infinite-dao · 2023-11-02T15:47:22Z

Parsing of name particles

In 3.0.16.0 I parsed an example with comma (real given example) and the interpreted same name without comma, and the results are different, should they not be the same? Using the dwcagent wrapper I get:

-----------
input: Reyna de Aguilar,M.L.
{"family":"Aguilar","given":"M.L.","particle":"Reyna de"}
-----------
input: M.L. Reyna de Aguilar
{"family":"Aguilar","given":"M. L. Reyna","particle":"de"}

… Reyna de Aguilar,M.L. is the the actual example (e.g. from «Flores,J., Montalvo,E.A., Reyna de Aguilar,M.L. & Calderón,M.») and M.L. Reyna de Aguilar is just the reverse of the comma concatenating version, like so:

Reyna de Aguilar,M.L.
………………………………………… ____
  ↘             ↙
    ⋅         ↙
      ⋅     ↙
        ⋅ ↙
        ↙ ⋅
      ↙     ⋅
    ↙         ⋅
  ↙             ↘
____ …………………………………………
M.L. Reyna de Aguilar

It is difficult to analyse the name particles — for instance the https://en.wikipedia.org/wiki/Nobiliary_particle or https://en.wikipedia.org/wiki/German_nobility#Nobiliary_particles — and they can describe a place or another family name so to say, I think, e.g.

Richard von Weizsäcker https://www.deutsche-biographie.de/pnd118766570.html?language=en — where Weizsäcker seems to be a family name, litterally Weiz(en) + Sack = wheat + bag, someone who is bagging wheat, so probably from a profession of millers ;-), so in Germany and Austria, „von“ means more or less =descending from (more a real family name to be considered, but it can also be a locality ;-), and „zu“=resident at aso.
«De Gracia,J.» → “J. De Gracia” seems more like a place
aso.

… so it is difficult to assign the name parts to either …

the particle part (and keep it perhaps strictly connected with the parsed name particle part), or to
the family or given name part

I think this is not solvable 100%, because there are too many combinations, I just want to mention it here.

dshorthouse · 2023-11-02T17:26:26Z

Looks like your variously arranged names, particles and initials:

-----------
input: Reyna de Aguilar,M.L.
{"family":"Aguilar","given":"M.L.","particle":"Reyna de"}
-----------
input: M.L. Reyna de Aguilar
{"family":"Aguilar","given":"M. L. Reyna","particle":"de"}

Confuse the underlying Namae gem, https://github.com/berkmancenter/namae/issues upon which dwc_agent is dependent on.

infinite-dao · 2023-11-13T14:31:08Z

By the way: brackets, this time round parentheses name (...) 😁 — in dwc_agent (3.0.16.0) all names in parentheses disappear, is that the general idea, the concept, that all contents with parentheses are removed?

Examples I found from WikiData with parentheses and are candidates that I thought to standardize by dwc_agent as well, are the following:

-----------
input: Gustav Adolf Ferdinand Eichler (1835-1906)
{"family":"Eichler","given":"Gustav Adolf Ferdinand"}
-----------
input: Søren Sørensen (1873-1926)
{"family":"Sørensen","given":"Søren"}
-----------
input: Georges André (1888–1973)
{"family":"André","given":"Georges"}
-----------
input: Johannes Johannessen (1904-1990)
{"family":"Johannessen","given":"Johannes"}
-----------
input: Helge Buen (1918-2005)
{"family":"Buen","given":"Helge"}
-----------
input: Vlk Valenta (1925-2010)
{"family":"Valenta","given":"Vlk"}
-----------
input: Amalesh Choudhury (bot.)
{"family":"Choudhury","given":"Amalesh"}
-----------
input: Bror Pettersson (botaniker)
{"family":"Pettersson","given":"Bror"}
-----------
input: Robert W. Jones (botanist)
{"family":"Jones","given":"Robert W."}
-----------
input: Thomas Cooper (botanist)
{"family":"Cooper","given":"Thomas"}
-----------
input: Yi Huang (botanist-1)
{"family":"Huang","given":"Yi"}
-----------
input: William Vernon (c. 1666-1711)
{"family":"Vernon","given":"William"}
-----------
input: James Smith (diatomist)
{"family":"Smith","given":"James"}
-----------
input: Josep María Vidal(-Frigola)
{"family":"Vidal","given":"Josep María"}
-----------
input: István Balázs (instruisto)
{"family":"Balázs","given":"István"}
-----------
input: Hildur von Rettig (Lindberg)
{"family":"Rettig","given":"Hildur","particle":"von"}
-----------
input: Inger Kaasa (Magistad)
{"family":"Kaasa","given":"Inger"}
-----------
input: Kai Zhang (mycologist)
{"family":"Zhang","given":"Kai"}
-----------
input: Ting-Ting Zhang (mycologist)
{"family":"Zhang","given":"Ting-Ting"}
-----------
input: Robert J. Ferry (Sr.)
{"family":"Ferry","given":"Robert J."}
-----------
input: Phraya Wanpruekphichan (Thongkham Savetsila)
{"family":"Wanpruekphichan","given":"Phraya"}
-----------
input: Phraya Winitwanandon (To Komet)
{"family":"Winitwanandon","given":"Phraya"}
-----------
input: Maria Pavlovna Nagibina (Tsybulskaya)
{"family":"Nagibina","given":"Maria Pavlovna"}
-----------
input: O. Heylen (-Walraevens)
{"family":"Heylen","given":"O."}
-----------
input: Bill Kasongo (Wa Ngoy Kashiki)
{"family":"Kasongo","given":"Bill"}

I did not quite expect that round parenthesis would be removed entirely, and I did expected that it would filter cases out or so and leave some in. One compromise one could think of, is to try to simply add the last parenthesis content to the parsing field family if it does not contain any dates or terms of occupations like botanist aso. — or shall this be documented that all round parenthesis content is removed?

dshorthouse · 2023-11-14T19:00:13Z

By the way: brackets, this time round parentheses name (...) 😁 — in dwc_agent (3.0.16.0) all names in parentheses disappear, is that the general idea, the concept, that all contents with parentheses are removed?

That was the general idea, yes. Most of the examples you have would seem to support the rationale, though admittedly there are some examples that appear to convey meaning (uncertainty perhaps?) about suffices , nicknames, or other implicit expression of family names / identity.

infinite-dao · 2023-11-15T13:06:30Z

That was the general idea, yes. Most of the examples you have would seem to support the rationale, though admittedly there are some examples that appear to convey meaning (uncertainty perhaps?) about suffices , nicknames, or other implicit expression of family names / identity.

I completely agree with this. So perhaps let it that way.

I have further analysed the botanist names from WikiData and divided up the cases that appear: most of them are actually different versions of names, squeezed into one line, so to speak. Actually, from a data point of view, a correction would have to be made and two or more names with the same meaning would have to be unravelled. An interesting case are the names with only one letter in the parentheses. Here are some examples of botanical names from WikiData:

grep --invert-match --extended-regexp '^[^()]+$\w{1,3}\W$[^()]+$' Liste_ursprünglich.txt > Liste_ursprünglich_vermindert_1.txt
grep --extended-regexp '^[^()]+$\w{1,3}\W$[^()]+$' Liste_ursprünglich.txt > Liste_ursprünglich_vermindert_1_Auszug.txt

Alan (C.) McKay
Carolyn [Caroline] (A.) Young
H.(Kh.) Karis
Kh.(Ch.) G. Kulieva
Yu.(Ju.) E. Petrov
D.(T.) O'Gorman
Robert J. Ferry (Jr.).
Tatiana (Yu.) Gagkaeva

grep --invert-match --extended-regexp '^[^()]+$\w{1}$[^()]+$' Liste_ursprünglich_vermindert_1.txt > Liste_ursprünglich_vermindert_2.txt
grep --extended-regexp '^[^()]+$\w{1}$[^()]+$' Liste_ursprünglich_vermindert_1.txt > Liste_ursprünglich_vermindert_2_Auszug.txt

Anton(i) Wróblewski
Dian Min(e) Chang
Jacob Frederic(k) Brenckle
…

grep --invert-match --extended-regexp '^[^()]+$\w{1,2}$[^()]+$' Liste_ursprünglich_vermindert_2.txt > Liste_ursprünglich_vermindert_3.txt
grep --extended-regexp '^[^()]+$\w{1,2}$[^()]+$' Liste_ursprünglich_vermindert_2.txt > Liste_ursprünglich_vermindert_3_Auszug.txt

Arthur C. (II) Grupe
Bruno E.C. (de) Miranda
Cun Ti (Di) Xiang
Davi Mesquita (de) Macedo
Julian(us) Hendrik Molkenboer
Manuel (de) Assunção Diniz
Marcos Antonio de (Jr) Morais
Marcus (de) Melo Teixeira
Priscila Sanjuan (de) Medeiros
You (Yu) Wen Tsui

grep --invert-match --extended-regexp '^[^()]+$\w{1,}$[^()]+$' Liste_ursprünglich_vermindert_3.txt > Liste_ursprünglich_vermindert_4.txt
grep --extended-regexp '^[^()]+$\w{1,}$[^()]+$' Liste_ursprünglich_vermindert_3.txt > Liste_ursprünglich_vermindert_4_Auszug.txt
grep --extended-regexp '^[^()]+\w$\w{1,}$[^()]+$' Liste_ursprünglich_vermindert_4_Auszug.txt

Adolph(Adolf) Osterwalder
Chris(toffel) F.J. Spies
Constantine Demetry(Dmitriev) Sherbakoff
Dénes(Dionisie) Pázmány
Peng Yong(Yun) Zhang
Ze(Tse) Xiang Peng
Zu(Tsu) Tang Yin
…

Aloysius (Luigi) Meschinelli
Aldworth William (Tommy) Thompson
Tina (Antje) Hofmann
Wen (Wan) Jia Zhu
Yin Tong (Tang) Xie
…

cat Liste_ursprünglich_vermindert_4.txt | sed -r 's@^[^()]*$[^()]*$[^()]*$@1 (…) - &@; s@^[^()]*$[^()]*$[^()]*$[^()]*$[^()]*$@2 (…) - &@; s@^[^()]*$[^()]*$[^()]*$[^()]*$[^()]*$[^()]*$[^()]*$@3 (…) - &@;' | sort

1 (…) - ('Wilson') Sze Wing Wong
1 (…) - (Alexandre Alexis) George Le Monnier
1 (…) - (Bartolomeo Giacomo) Rinaldo Corradi
1 (…) - (Léon Marie Joseph) Gustave Nicolas
1 (…) - (Antonius) Theodoor Wegelin
1 (…) - (B.) Alfred Steinmann
1 (…) - (M.)F. Weyhe
1 (…) - (Mrs.) Leslie Hofer [Mrs. Vincent W.] Lanfear
1 (…) - (Q.)S.Y. Yeung
1 (…) - (Sister) Little Flower
1 (…) - (Wilhelm) William Winkler
1 (…) - A.H.G. ('Bert') Gerrits van den Ende
1 (…) - Albert (Jacob Josef) Vandevelde
1 (…) - August (François Marie Antoine) Tonglet
1 (…) - Amalesh Choudhury (bot.)
1 (…) - Bror Pettersson (botaniker)
1 (…) - István Balázs (instruisto)
1 (…) - James Smith (diatomist)
1 (…) - Robert W. Jones (botanist)
1 (…) - Ting-Ting Zhang (mycologist)
1 (…) - Yi Huang (botanist-1)
1 (…) - Analy Salles (de Azevedo) Melo
1 (…) - Bill Kasongo (Wa Ngoy Kashiki)
1 (…) - Franz August (`Friedrich') Müller
1 (…) - Friedrich (`Franz') Joseph Schelver
1 (…) - Frédéric-Edouard (`Fritz') Kampmann
1 (…) - Félix de Azara (1742-1821)
1 (…) - G. (of Nancy) Gardet
1 (…) - G.(of Bavaria) Gerber
1 (…) - G.I. (H.J.) Sëmina
1 (…) - Geoffrey S. ('Geoff') Hall
1 (…) - Georges André (1888–1973)
1 (…) - Gustav Adolf Ferdinand Eichler (1835-1906)
1 (…) - Giles E. (St J.) Hardy
1 (…) - H.(of Freiburg) Schmidt
1 (…) - H.J. (`Harry') Hudson
1 (…) - Ion(Ioan,Joan) C. Constantineanu
1 (…) - Jac (N.J.) Gelderblom
1 (…) - James Michael ('Jim') Miller
1 (…) - Jennifer Anne ('Jenny') Davidson
1 (…) - Robert J. Ferry (Sr.)
1 (…) - Robert Leroy ('Bob') Hanrahan
1 (…) - Ronald L. ('Ron') Exeter
1 (…) - Russell J. ('Rusty') Rodriguez
1 (…) - Rüdiger Felix (Ruggero Felice) Solla
1 (…) - Sandra L. ('Sandie') Baldauf
1 (…) - Saun-ichirô (Shun-ichirô) Imamura
1 (…) - Stig (Gunnar Anton) Waldheim
1 (…) - Stip (B.R.) Helleman
1 (…) - Søren Sørensen (1873-1926)
1 (…) - Thomas Cooper (botanist)
1 (…) - Vlk Valenta (1925-2010)
1 (…) - William Vernon (c. 1666-1711)
1 (…) - Yin Chan(Ch'An) Wu
1 (…) - Yvette Berenice ('Tivvy') Harvey

2 (…) - (Axel) Helge (Svensson) Stenar
2 (…) - (Carl) Julius (Adolf) Scharlock
2 (…) - Geoff(rey) (S.) Pegg
2 (…) - (Georg) Emil (Carl Christoph) Schuez
2 (…) - (J.A.A.)M.(H.) Goossens-Fontana
2 (…) - (Ludwig) Bernhard (Ehregott) Schmid
2 (…) - Matt(hew) (J.) Trappe
2 (…) - (Philippe) Victoire (Lévêque) de Vilmorin
2 (…) - (Theodor) Julius (Reinhold) von Schröder

3 (…) - (Johan) Fredrik(Friedrich) (Eberhard) Svanlund

infinite-dao · 2023-11-16T14:49:59Z

Names of rulers or nobles aso.

In our botanical data (https://dr.jacq.org/DR014960) there are also names such as
Friedrich August II.,König von Sachsen s.n., and this is currently processed not quite right, there is also a name standardisation in Germany (see https://explore.gnd.network/en/gnd/118917218). If you parse this name in different ways, you get the following result (dwc_agent 3.0.16.0):

-----------
input: Friedrich August II.
{"family":"August","given":"Friedrich","suffix":"II."}
-----------
input: Friedrich August II.,König von Sachsen
{"family":"August","given":"Friedrich","suffix":"II."}
{"family":"Sachsen","given":"König","particle":"von"}
-----------
input: Friedrich August II.,König von Sachsen s.n.
output: []

… the last case should output something, but does not 🤔 …

If you take the name „Friedrich August II,König von Sachsen“ strictly, you could understand it, or see it like this:

{"family":"", "given": "August Friedrich", "suffix": "II.", "title": "König von Sachsen"}

or (if necessary)

{"family": "König von Sachsen", "given": "August Friedrich", "suffix": "II."}

... since August and Friedrich are actually first names in German

The same naming problem arises with e.g. „August der Starke“, as https://explore.gnd.network/en/gnd/118505084 contains the standard data, and would be standardised as: „August II, Polen, König“, and it makes parsing very tricky – so in both cases there is no family name in the strict sense, is there? The same might apply to English names aso..

dshorthouse added a commit that referenced this issue Sep 13, 2023

Tidy-up some more issues identified in #18

8052896

dshorthouse closed this as completed in fa19a8d Sep 18, 2023

dshorthouse reopened this Sep 18, 2023

dshorthouse added a commit that referenced this issue Oct 9, 2023

Remove square brackets to accommodate #18

0b6f6cb

dshorthouse added a commit that referenced this issue Oct 13, 2023

More work on issues uncovered in #18

99a6595

infinite-dao mentioned this issue Nov 2, 2023

Improve the preparation for the name match (given name, family, particle, etc.) infinite-dao/collector-matching#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving of name parsing (parse and clean) #18

Improving of name parsing (parse and clean) #18

infinite-dao commented Sep 11, 2023

dshorthouse commented Sep 11, 2023

infinite-dao commented Sep 13, 2023 •

edited

Loading

infinite-dao commented Sep 13, 2023

infinite-dao commented Sep 18, 2023

dshorthouse commented Sep 18, 2023

dshorthouse commented Sep 18, 2023

infinite-dao commented Sep 20, 2023

infinite-dao commented Sep 20, 2023

infinite-dao commented Sep 20, 2023 •

edited

Loading

dshorthouse commented Sep 23, 2023

infinite-dao commented Oct 9, 2023

infinite-dao commented Oct 11, 2023 •

edited

Loading

dshorthouse commented Oct 14, 2023

infinite-dao commented Nov 2, 2023

dshorthouse commented Nov 2, 2023

infinite-dao commented Nov 13, 2023 •

edited

Loading

dshorthouse commented Nov 14, 2023

infinite-dao commented Nov 15, 2023

infinite-dao commented Nov 16, 2023 •

edited

Loading

Improving of name parsing (parse and clean) #18

Improving of name parsing (parse and clean) #18

Comments

infinite-dao commented Sep 11, 2023

dshorthouse commented Sep 11, 2023

infinite-dao commented Sep 13, 2023 • edited Loading

infinite-dao commented Sep 13, 2023

infinite-dao commented Sep 18, 2023

dshorthouse commented Sep 18, 2023

dshorthouse commented Sep 18, 2023

infinite-dao commented Sep 20, 2023

infinite-dao commented Sep 20, 2023

infinite-dao commented Sep 20, 2023 • edited Loading

dshorthouse commented Sep 23, 2023

infinite-dao commented Oct 9, 2023

infinite-dao commented Oct 11, 2023 • edited Loading

dshorthouse commented Oct 14, 2023

infinite-dao commented Nov 2, 2023

Parsing of name particles

dshorthouse commented Nov 2, 2023

infinite-dao commented Nov 13, 2023 • edited Loading

dshorthouse commented Nov 14, 2023

infinite-dao commented Nov 15, 2023

infinite-dao commented Nov 16, 2023 • edited Loading

Names of rulers or nobles aso.

infinite-dao commented Sep 13, 2023 •

edited

Loading

infinite-dao commented Sep 20, 2023 •

edited

Loading

infinite-dao commented Oct 11, 2023 •

edited

Loading

infinite-dao commented Nov 13, 2023 •

edited

Loading

infinite-dao commented Nov 16, 2023 •

edited

Loading