-
Notifications
You must be signed in to change notification settings - Fork 13
Name Matching Algorithm Re design
Doug Palmer edited this page May 2, 2017
·
4 revisions
The name matching algorithm is a rather complex collection of hard-coded rules attempting to handle the various lunacies that get thrown at it.
Rather than attempt to expand it and add to its complexity, a re-think is, possibly, in order.
The available information for name matching consists of one or more of the following
- A scientific name, possibly partial or aggregate
- An author
- A vernacular name
- Higher-order taxonomic information, such as kingdom, phylum, class, etc.
- A rank
- Location information
- Date information
- A taxonID, scientificNameID or taxonConceptID, potentially from a different namespace. These provide hints as to previous matches.
- The matched name represents the lowest-ranked taxon that is most compatible with all the information available
- Or, if you prefer, the lowest-ranked taxon that does not conflict with the information available
- Exactly how we handle contradictory information is yet to be developed
- Use the entirety of the information available at all times. This includes higher-order taxonomic information
- Allow spatial information to be used for things like excluded and misapplied names, as well as sanity checking
- Allow date information to be used to match names that existed at the time. This includes things like resolving parent-child synonyms where the original species has been moved to be a subspecies.
- Allow homonym resolution
- Allow old IDs to be mapped onto new IDs
- Allow synonym resolution. This includes resolution of annoying pro-parte synonyms to the least upper bound
- Handle spelling/orthographic variations gracefully, including switches between Latin genders
- Handle rancid garbage such as aff. cf. sp. and voucher names.
- Handle author abbreviations
- Things that don't fit into the general flow of the algorithm should be rules-based and driven by an engine, rather than hard coded.
- The source name index should be assembled from multiple sources and accumulate information
See https://github.com/AtlasOfLivingAustralia/bie-index/wiki/Index-building-re-design
- This is pretty much required to be a vanilla java library, so that it can be embedded in anything that needs name matching.
- A new API is probably in order, allowing more information to be supplied. The old API needs to be kept for backwards compatibility.
To be erected here. A shiny new algorithm. See https://github.com/AtlasOfLivingAustralia/data-management/issues/176