Skip to content

Latest commit

 

History

History
225 lines (159 loc) · 18.5 KB

Data-Processing.en.adoc

File metadata and controls

225 lines (159 loc) · 18.5 KB

Data Processing

Fit for purpose data

Almost always you will want to post-process your GBIF download in some way to fit your purposes. Sometimes you will have to make difficult judgement calls for your particular use-case. Whenever you are dealing with thousands or millions of records, you will never quite know the true quality of the source data. It is important to keep in mind that you are always just mitigating data quality issues, not eliminating them.

The data that we get in GBIF download, will contain data from a range of sources and the data will likely vary in its correctness and consistency. Correctness and consistency are two ways of documenting data errors and are measures of data quality. These are measures of how well the data gatherer was able to capture the true value being investigated. The nature of GBIF’s data publication workflow means that the correctness and consistency of the data can vary dependent on the data publishers and the source of the data. Knowing these properties of the data you have, will help you to understand the ways in which you can and cannot clean, validate and process the data.

  • Correctness (Accuracy) - closeness of measured values, observations or estimates to the real or true value e.g. has the species been identified correctly or the collection locality been identified correctly.

Correctness

For instance, if we are studying plant biogeography in Indonesia, and want to do to a specific analysis for only one of the islands within the archipelago, then an appropriate question might be - Have localities on the island been correctly georeferenced?

  • Consistency (Precision) - level of resolution of the data e.g. precision of coordinates, taxonomic determination.

Consistency

In the Indonesian example, an appropriate question might be - Does the uncertainty in the coordinate estimate allow for the occurrence record to not be on the island?

As a general rule, for most analyses you want highly accurate data although the level of precision may vary dependent on your analysis. GBIF can help you to determine the accuracy and precision of the data through, for example, filters and issue flags, however, you must always double-check!

Flags and Issues

The GBIF network publishes datasets, integrating them into a common access system. Here users can retrieve data through common search and download services. During the indexation process over the raw data, GBIF adds issues and flags to records with common data quality problems. These may contain useful information for you as a user to create fit-for-purpose datasets.

workflow2

Remarks are shown on the individual occurrence pages to explain the process done after interpretation:

*Excluded means the original data couldn’t be interpreted, so is excluded in the interpreted fields.

*Altered means the original data is modified in the interpretation process to be indexed in GBIF.org.

*Inferred means the Using other record information the data indexed is inferred, if the original is empty.

Excluding all records with a particular issue is not currently possible with the search interface. It is possible to filter all records you are not interested in with issues by selecting the particular issue and hitting the reverse button. However, reversing will still only give you all other flagged occurrences and not issue-free records. This is something that GBIF is working to improve. (at occurrence search)

filtering

A full overview of all issues and flags can be here: https://data-blog.gbif.org/post/issues-and-flags/

Data Processing Tools

While GBIF filters will allow for some data processing i.e. select only those data for download that fulfil certain criteria, it is highly recommended that you additional data processing and a range of tools are available. Your choice of which tool you use will be based on personal access, familiarity and utility of each of the tools. Tools that can be used for this are:

  • Spreadsheet editing software e.g. Excel, Google Sheets (smaller datasets)

  • OpenRefine

  • R packages and scripts eg rgbif, CoordinateCleaner, scrubr and biogeo (automated data processing)

  • Geographical Information Systems (GIS) e.g. ArcGIS, QGIS and MapInfo

It is always important to include a data visualisation step in your data processing so that you can identify anomalous data points that may have been missed during the processing stage. For those processing in R, this can be done within R, however, with OpenRefine or spreadsheet editing sofware you may have to use GIS software or even tools such as Google Earth for data visualisation.

Handling Taxonomic Uncertainty

Uncertainty surrounding the taxonomy of a data point can arise for several reasons:

  • Species mis-identification

  • Synonymy

  • Novel names

Species mis-identification

Species identification is a complex process, with species typically described from a certain set of characters identified in a published species description and linked to a type specimen held within a scientific collection that be used for validation of species identification. Where taxa are very similar or a set of complex traits are required for correct identification, specific taxonomic expertise may be required that data publishers may not possess leading to a mis-identification of a species. As users, you must have a clear understanding of how taxonomic determinations for your interest group are made:

  • What are the characters used for defining the species?

  • Are these characters easily confused or captured when the species is observed or collected?

  • Are there related species that could be easily confused with the species you are interested in?

If you think that there is a risk that species may be incorrectly identified, you can take a conservative approach to the data you use and only use those data linked to specimens in collections where taxonomic validation would be possible and eliminate other data sources. Another approach may be to use associated data such as collector information, media, DNA sequences etc to validate the taxonomic determination.

Synonymy

Synonymy can arise when the same species has been described several times and a new name is given to the species each time it is described, or, when there is a change in the taxonomy of a species, for example, a species is moved from one genus to another. Only one species name can be accepted, and other names are what we call synonyms. These synonyms may still be in use to a lesser or greater extent and you should be sure when getting data from GBIF to obtain data for the taxonomic name you need. GBIF’s taxonomic backbone differentiates between accepted scientific names and synonyms, and unique identifiers in the form of taxon keys. Species searches https://www.gbif.org/species/search allow for filtering for accepted names and synonyms and taxon keys can be used for programmatic searches of GBIF.

Taxon Keys Scientific names can be messy. If you are accessing GBIF-mediated data programatically as opposed to via the website, taxon keys provide an effective way for defining searches based on taxonomy. Taxon keys are issued at the species, genus family, order, phylum and kingdom level. Unique identifiers are issued to accepted names with synonyms of those accepted names issued the same identifier. So, it may make sense to sort out the species by their unique taxon keys provided during the indexation of the dataset by GIBF.

New names

There may be instances where the scientific name does not match any name in the GBIF backbone, perhaps because the species is newly described, or is not within a checklist used by GBIF to construct its backbone. These names are flagged with the TAXON_MATCH_HIGHERRANK flag indicating that the scientific name has not been recognised but that the data point has matched at a higher taxonomic level eg. genus or family. This flag can be used for identifying and filtering for these data. When names have been misspelled or badly formatted, there is also a TAXON_MATCH_FUZZY flag that can be used for identifying and filtering names that can only match the taxonomic backbone using a fuzzy, non exact match.

Handling Data Quality

Filtering the data allows you as a user to obtain the data that is most fit for purpose. All searches have a set of filters that can be used for finding the data you need, and occurrence searches have a set of additional 'Advanced" search filters for users that need to do more advanced filtering. While filters may allow you to filter out data that may not be relevant, or be of lower quality for your purposes, additional filtering may be required either manually or programmaticially to deal with additional data quality issues that arise during the GBIF data publishing model. Below are some common data filters that you as a user might consider to make the data more fit-for-purpose.

Geospatial Filters & Issues

The data can be filtered spatially in an occurrrence search in one of 3 ways:

  • Country or area/Continent - data is filtered by country and will include data within the Exclusive Economic Zone (EEZ)

  • Administrative area - this filter uses the GADM database https://gadm.org/data.html of administrative areas for all countries in the world to allow for GBIF removes common geospatial issues by default if you choose to have data with a location.

  • Location - this filter allows you to filter for data with coordinates and/or draw your own polygon shape filters or use a GeoJSON file to delimit your own shape filter. If you filter for those data with coordinates, a number of geospatial issues associated with the data publishing workflow will be eliminated. These are:

    • Zero Coordinates- Coordinates are exactly (0,0) or what is sometimes called "null island". Zero-zero coordinate is a very common geospatial issue. GBIF removes (0,0) when hasgeospatialissue is set to FALSE.

    • Country coordinate mis-match - Data publishers will often supply GBIF with a country code (US,TW,SE,JP…). GBIF uses the two letter ISO 3166-1 alpha-2 coding system - https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2. When a point does not fall within the country’s polygon or EEZ, but says that it should occur within the country, it gets flagged as having “country coordinate mis-match” and will be removed if data are filtered for locations.

    • Coordinate invalid - If GBIF is unable to interpret the coordinates i.e. the coordinates.

    • Coordinate invalid - The coordinates are outside of the range for decimal lat/lon values -90,90), (-180,180.

Country centroids

Country centroids are where the observation is pinned to the centre of the country instead of where the taxon was observed or recorded. Country centroids are usually records that have been retrospectively given a lat-lon value based on a textual description of where the original record was located. Geocoding software uses gazetteers, geographical dictionaries or directories used in conjunction with a map or atlas, to attribute coordinates to place names. So, if the record simply says “Brazil”, some publishers will put the record in the center of Brazil. Similarly if the record simply says “Texas” or “Paris” the record will go in the center of those regions. This is almost exclusively a feature of museum data (PRESERVED_SPECIMEN), but it can also happen with other types of records as well.

Identifying country centroid data is currently not possible using GBIF filters, however, the R package CoordinateCleaner can be used for identifying and filtering for country centroids.

Points along the equator or prime meridian

Some publishers consider zero and NULL to be equivalent so that empty latitude and longitude fields for a record are given a zero value. As a result, records end up being plotted along the equator and prime meridian lines.

Uncertain location

Often you will want to be sure that the coordinates give a certain location and are not really 1000s of km away from where the organism was observed or collected. There are two fields - coordinate precision and coordinateUncertaintyInMeters - in Darwin Core that you get with a SIMPLE CSV download. that you can use to filter by “uncertainty”. However, these fields are not used very often by publishers who feel that their records are fairly certain (from a GPS) and we would recommend not filtering out missing values.

There are also a few “fake” values for coordinate uncertainty that you should be aware of. These values are errors produced by geocoding software and do not represent real uncertainty values. These "fake" values are 301, 3036, 999 and 9999. In the case of the value 301, the uncertainty is often much-much greater than 301 and actually represents a country centroid.

Gridded datasets

Gridded datasets are a known problem at GBIF. Many datasets have equally-spaced points in a regular pattern. These datasets are usually systematic national surveys or data taken from some atlas (“so-called rasterized collection designs”). Georeferenced occurrences are snapped to a central point

gridSnap

Most publishers of gridded datasets actually fill in one of the following columns: coordinateuncertaintyinmeters, coordinateprecision, footprintwkt So filtering by these columns can be a good way to remove gridded datasets. The R package Coordinate cleaner also has a function for removing gridded datasets. GBIF has an experimental API for identifying datasets which exhibit a certain about of "griddyness". You can read more here: https://data-blog.gbif.org/post/finding-gridded-datasets/

Absence records

By default, both presence and absence records are shown when you search www.gbif.org. Absence records confirm that a species was not found at a specific locality when that area was surveyed and this information can be useful in, for example, developing ecological niche models. However, you may only be interested in presence records and in this instance you can filter for only presence records using the Occurrence Status filter.

Establishment Means

The Darwin Core term establishmentMeans identifies the process by which the biological individual(s) represented in the Occurrence became established at the location. As such, it can serve as a useful filtering tool for identifying records that are outside of a species native range with accepted terms for this field being native, nativeReintroduced, introduced, introducedAssistedColonisation, vagrant and uncertain. Currently, GBIF records can be searched using the older vocabulary terms native, introduced, naturalized, native, managed and uncertain - https://rs.gbif.org/vocabulary/gbif/establishment_means.xml, and these will be updated in late 2022. In some instances, removing “MANAGED” records will remove zoo records.

Use this filter cautiously, however, as most records do not contain this information and so would be exluded from a search with this filter on. We would recommend to use the information within the Establishement Means term for filtering after download.

Basis of Record

Basis of record is a Darwin Core term that refers to the specific nature of the record and can refer to one of 6 classes:

  • Living Specimen - a specimen that is alive, for example, a living plant in a botanical garden or a living animal in a zoo.

  • Preserved Specimen - a speciment that has been preserved, for example, a plant on an herbarium shett or a cataloged lot of fish in a jar.

  • Fossil Specimen - a preserved specimen that is a fossil, for example, a body fossil, a coprolite, a gastrolith, an ichnofossil or a piece of petrified tree.

  • Material Citation - A reference to, or citation of, one, a part of, or multiple specimens in scholarly publications, for example, a citation of a physical specimen from a scientifci collection in taxonomic treatment in a scientiufic publication or an occurrence mentioned in a field note book.

  • Human Observation - an output of human observation process eg. evidence of an occurrence taken from field notes or literature or a records of an occurence without physical evidence nor evidence captured with a machine.

  • Machine Observation - An output of a machine observation process for example a photograph, a video, an audio recording, a remote sensing image or an occurrence record based on telemetry.

Basis or record should allow users to filter out those indidivuals in ex-situ collections such as zoos and botanic gardens or fossils as well as filter for those records based on whether the record is based on a specimen or an observation, which can support taxonomic validation. You should note that, even though this can be a useful filter, data publishers do not always fill the basis of record field correctly, or, there may be nuances in the data that may not be immediately obvious to a user e.g. https://data-blog.gbif.org/post/living-specimen-to-preserved-specimen-understanding-basis-of-record/ and you should always double check your data before use.

Old Records

GBIF has many museum records that might be older than what is desired for some studies.

Duplicates

Duplication of records can occur when several records of the same individal are made. This can occur from for instance, a researcher depositing several specimens from an individual tree in herbaria around the world who all then publish these data on GBIF, or when an individual has been deposited in a natural history collection and the indidivual was also sampled for its DNA. In this instance, there will be a record for the specimen in the collections and one for the DNA sequence.

GBIF has recently introduces a clustering function in its advanced search that allows users to identify clusters of records i.e. records that appear to be derived from the same source. This allows users to identify potential duplicated data and filter for these out of your download. Note that if you filter out those records that are in a cluster, you will lose all records found within that cluster and will lose potentially useful data. The filter may be better used to indicate the extent to which there is duplication in the dataset, or for indepedent donwloads of the clustered and non-clustered datasets for comparison.