This repository contains code for processing datasets before importing them into Georg, a georeferencing tool built on top of Pelias. The data processing is carried out with the workflow management system Snakemake and a few Python scripts.
There are currently two different data pipelines: one for GBIF Sweden datasets and one for Sweden's virtual herbarium.
For the GBIF workflow, we use Darwin Core archives obtained from http://gbif.se/ipt/. For the time being, three occurrence datasets are downloaded and processed:
nhrs-nrm: | GBIF-Sweden, Entomological Collections (NHRS), Swedish Museum of Natural History (NRM). DOI: 10.15468/fpzyjx |
---|---|
s-fbo: | GBIF-Sweden, Phanerogamic Botanical Collections (S). DOI: 10.15468/yo3mmu |
uppsala-botany: | GBIF-Sweden, Botany (UPS). DOI: 10.15468/ufmslw |
From Sweden's Virtual Herbarium we use one dataset for socknar (socken in singular) in SQL format. Before processing the dataset, it is exported into a single TSV file. The source data can be obtained from: https://github.com/mossnisse/Virtuella-Herbariet/blob/master/SQL/samhall_district.sql.
An easy way to get Python working on your computer is to install the free Anaconda distribution.
You can install the libraries with the following command:
pip install pandas snakemake spacy
Input files should be placed at the following locations:
./gbif/data/raw/{dataset}/occurrence.txt
./virtual-herbarium/data/raw/
.
After executing the workflows, you should be able to find the output files in the following directories:
./gbif/data/processed/
./virtual-herbarium/data/processed/
Navigate to the relevant subdirectory and enter the following on the command-line (adjust the number of CPU cores to fit your environment):
snakemake --cores 4
The file ./gbif/config.yaml
determines which GBIF datasets to include,
and how the included datasets are processed.
Named Entity Recognition (NER) is used to extract place names from texts in the GBIF pipeline. A language model that has been trained on transcripts of mainly Swedish labels is included in this repository.
The two workflows has been executed under Python 3.7 with the following Python packages installed:
appdirs==1.4.4 attrs==19.3.0 blis==0.4.1 catalogue==1.0.0 certifi==2020.4.5.1 chardet==3.0.4 ConfigArgParse==1.2.3 cymem==2.0.3 datrie==0.8.2 decorator==4.4.2 docutils==0.16 gitdb==4.0.5 GitPython==3.1.3 idna==2.9 importlib-metadata==1.6.1 ipython-genutils==0.2.0 jsonschema==3.2.0 jupyter-core==4.6.3 murmurhash==1.0.2 nbformat==5.0.6 numpy==1.18.5 pandas==1.0.4 plac==1.1.3 preshed==3.0.2 psutil==5.7.0 pyrsistent==0.16.0 python-dateutil==2.8.1 pytz==2020.1 PyYAML==5.3.1 ratelimiter==1.2.0.post0 requests==2.23.0 six==1.15.0 smmap==3.0.4 snakemake==5.19.2 spacy==2.2.4 srsly==1.0.2 thinc==7.4.0 toposort==1.5 tqdm==4.46.1 traitlets==4.3.3 urllib3==1.25.9 wasabi==0.6.0 wrapt==1.12.1 zipp==3.1.0
The code in this repository is distributed under the MIT license.
Markus Englund