Skip to content

Name Matching Algorithm Re design

Doug Palmer edited this page May 2, 2017 · 4 revisions

Motivation

The name matching algorithm is a rather complex collection of hard-coded rules attempting to handle the various lunacies that get thrown at it.

Rather than attempt to expand it and add to its complexity, a re-think is, possibly, in order.

Input

The available information for name matching consists of one or more of the following

  • A scientific name, possibly partial or aggregate
  • An author
  • A vernacular name
  • Higher-order taxonomic information, such as kingdom, phylum, class, etc.
  • A rank
  • Location information
  • Date information
  • A taxonID, scientificNameID or taxonConceptID, potentially from a different namespace. These provide hints as to previous matches.

Requirements

  • The matched name represents the lowest-ranked taxon that is most compatible with all the information available
    • Or, if you prefer, the lowest-ranked taxon that does not conflict with the information available
    • Exactly how we handle contradictory information is yet to be developed
  • Use the entirety of the information available at all times. This includes higher-order taxonomic information
  • Allow spatial information to be used for things like excluded and misapplied names, as well as sanity checking
  • Allow date information to be used to match names that existed at the time. This includes things like resolving parent-child synonyms where the original species has been moved to be a subspecies.
  • Allow homonym resolution
  • Allow old IDs to be mapped onto new IDs
  • Allow synonym resolution. This includes resolution of annoying pro-parte synonyms to the least upper bound
  • Handle spelling/orthographic variations gracefully, including switches between Latin genders
  • Handle rancid garbage such as aff. cf. sp. and voucher names.
  • Handle author abbreviations
  • Things that don't fit into the general flow of the algorithm should be rules-based and driven by an engine, rather than hard coded.

Index building

  • The source name index should be assembled from multiple sources and accumulate information

See https://github.com/AtlasOfLivingAustralia/bie-index/wiki/Index-building-re-design

Backwards compatibility

  • This is pretty much required to be a vanilla java library, so that it can be embedded in anything that needs name matching.
  • A new API is probably in order, allowing more information to be supplied. The old API needs to be kept for backwards compatibility.

Matching

To be erected here. A shiny new algorithm. See https://github.com/AtlasOfLivingAustralia/data-management/issues/176