Skip to content

General information

Jonathan A Rees edited this page Nov 13, 2019 · 10 revisions

This page describes the taxonomy used by the Open Tree of Life project.

There is a writeup on the assembly process and on the taxonomy: Automated assembly of a reference taxonomy for phylogenetic data synthesis. This wiki page predates that article.

The taxonomy is a merge of SILVA, NCBI/Genbank Taxonomy, the GBIF "nub" taxonomy, Index Fungorum, IRMNG, and a few others. There is a list of requirements for potential new inputs into OTT, see below.

Release notes

Release notes, with download locations, reside in the repository, here.

Gotchas

This taxonomy is not authoritative by any stretch of the imagination. It is a product of expedience meant to fill the particular immediate needs of Open Tree of Life, nothing else.

Mistakes might come from any of the source taxonomies or introduced by us.

When there is any question about parent/child taxon relationships higher-priority taxonomies always take precedence over lower-priority ones. SILVA and Index Fungorum are highest priority, then NCBI, GBIF, and IRMNG, in that order.

When mapping a source taxonomy into the union taxonomy, if a name occurring in both places is deemed to be a match, all of the source children that don't map to some other union taxon are added as children of the union node. Usually that will mean all of them, but there are a number of cases (about 1500) where the source taxon is "paraphyletic" and the decision as to where to place the children (when they don't already belong to the corresponding union taxon) is somewhat arbitrary.

A name is sometimes judged by a process of elimination as naming a single unified taxon - that is, there is no reason to think there's only one taxon instead of two, other than two that happen to have the same name; but no evidence to the contrary either. This is the case for about 4000 tips (usually species) and 500 internal taxa. This kind of argument is weak (especially in the case of genera) and the name might in fact name two different taxa homonymously, one from each source taxonomy. Many examples have been found and for now they are corrected manually.

Contrariwise, sometimes there is evidence that a name means different taxa in different sources, with no evidence it only names one taxon, and so the merge process creates homonyms that weren't homonyms in either input taxonomy. This determination is heuristic and may be wrong in some causes (in fact, probably most of the time; typical example: Parauronematidae), with the effect that a single taxon appears to occur in multiple places in the tree. There are about 6000 of these names.

Representation

[TBD: Move this information to the Interim taxonomy file format page!]

Taxonomy

File taxonomy.tsv = the taxonomy itself. There is one row per taxon. The column separator (following NCBI's example) is tab-stroke-tab.

Columns:

  1. OTT identifier - these have been kept stable relative to OTToL 1.0
  2. OTT identifier for the parent of this taxon, or empty if none
  3. Name (e.g. "Rana palustris")
  4. Rank ("genus" etc.) Note: OTT is a merger of multiple taxonomies. The ranks can provide about whether taxa in input taxonomies correspond, but the output rank column is not reliable.
  5. Sources - this takes the form tag:id,tag:id where tag is a short string identifying the source taxonomy (currently just "ncbi" or "gbif") and id is the numeric accession number within that taxonomy. Examples: ncbi:8404,gbif:2427185 ncbi:1235509
  6. Unique name - if the name is a homonym, then the name qualified with its rank and the name of its parent taxon, e.g. "Roperia (genus in family Hemidiscaceae)"
  7. Flags - see https://github.com/OpenTreeOfLife/taxomachine/blob/master/src/main/java/org/opentree/taxonomy/OTTFlag.java

Synonyms

File synonyms.tsv - this is a simple mapping of synonym to OTT identifier. The content derives from NCBI; currently we don't harvest synonyms from GBIF (although it has a ton of them). Two columns, separated by tab-stroke-tab:

  1. Name
  2. OTT identifier

Deprecated

File deprecated.tsv - taxa that are in version 1.0 but need to be deleted because they were deemed incorrect in some regard (incorrectly placed, ambiguous, synonyms, etc)

Column separator is just tab (beware!). Of primary interest is the first column, which is an OTT identifier for a taxon in a previous version of OTT/OTToL, that in this version has been deprecated. Any uses of such an id ought to be reprocessed by a TNRS or similar mechanism.

Aux (Pre-OTToL mapping)

File aux.tsv - mapping of PreOTToL ids into OTT 2.0. There is an entry for every PreOTToL id that maps to OTT 2.0, and in addition entries for PreOTToL ids for which the OTToL 1.0 file provided a mapping. Column separator is tab.

No longer maintained.

  1. PreOTToL identifier
  2. OTT identifier, if the PreOTToL id maps to OTT 2.0, or empty, if OTToL gave a mapping of the PreOTToL id to OTToL 1.0 but there is no mapping to OTT 2.0.
  3. Comment

Log file

File log.tsv - detailed trace of merge algorithm for those names for which the process was "interesting". Currently this is probably only readable by me (JAR). Can be used for diagnosing problems and explaining mapping decisions.

Future work

  • Incorporate additions from source trees proposed via treemachine and/or phylografter, as needed
  • Add more source taxonomies
  • Taxonomy level metadata

Requirements

Following is the analysis that led to the current design of OTT, copied from the minutes of a meeting of the software group held in January 2013:

Requirements on inputs to the opentree taxonomy synthesis step

  • Source of our requirements = ingest (matching tree tips) and query (searching for parts of synthetic tree)
  • We (opentree) can do a limited amount of programmatic synthesis/stitching (no manual steps) but...
  • Minimize number of input taxonomies that feed into opentree taxonomy synthesis process... we want someone else to be responsible for being comprehensive
  • Combined set of input taxonomies should be comprehensive
  • NCBI at .4M is not comprehensive enough
  • Should pass our informal spot checks
  • Must be of adequate precision (in particular should not treat IRMNG homonym list as valid)
  • Functional hierarchy - each should be a tree (not a forest, not a graph, no orphan taxa)
  • Each should have a commitment to active maintenance, should be responsive to our bug reports
  • Should be open (probably we need public domain) (there are possible problems with some candidate input taxonomies)
  • We can repair problematic backbone issues in inputs, by overriding bad sources with good ones (cf. synthesis above)