Preparation of data for Georg

This repository contains code for processing datasets before importing them into Georg, a georeferencing tool built on top of Pelias. The data processing is carried out with the workflow management system Snakemake and a few Python scripts.

There are currently two different data pipelines: one for GBIF Sweden datasets and one for Sweden's virtual herbarium.

For the GBIF workflow, we use Darwin Core archives obtained from http://gbif.se/ipt/. For the time being, three occurrence datasets are downloaded and processed:

nhrs-nrm:	GBIF-Sweden, Entomological Collections (NHRS), Swedish Museum of Natural History (NRM). DOI: 10.15468/fpzyjx
s-fbo:	GBIF-Sweden, Phanerogamic Botanical Collections (S). DOI: 10.15468/yo3mmu
uppsala-botany:	GBIF-Sweden, Botany (UPS). DOI: 10.15468/ufmslw

From Sweden's Virtual Herbarium we use one dataset for socknar (socken in singular) in SQL format. Before processing the dataset, it is exported into a single TSV file. The source data can be obtained from: https://github.com/mossnisse/Virtuella-Herbariet/blob/master/SQL/samhall_district.sql.

Prerequisites

Python 3.7
The Python libraries pandas, spaCy, and Snakemake

An easy way to get Python working on your computer is to install the free Anaconda distribution.

You can install the libraries with the following command:

pip install pandas snakemake spacy

Input files

Input files should be placed at the following locations:

./gbif/data/raw/{dataset}/occurrence.txt
./virtual-herbarium/data/raw/.

Output files

After executing the workflows, you should be able to find the output files in the following directories:

./gbif/data/processed/
./virtual-herbarium/data/processed/

Running the workflows

Navigate to the relevant subdirectory and enter the following on the command-line (adjust the number of CPU cores to fit your environment):

snakemake --cores 4

The file ./gbif/config.yaml determines which GBIF datasets to include, and how the included datasets are processed.

Named Entity Recognition (NER) is used to extract place names from texts in the GBIF pipeline. A language model that has been trained on transcripts of mainly Swedish labels is included in this repository.

The two workflows has been executed under Python 3.7 with the following Python packages installed:

appdirs==1.4.4
attrs==19.3.0
blis==0.4.1
catalogue==1.0.0
certifi==2020.4.5.1
chardet==3.0.4
ConfigArgParse==1.2.3
cymem==2.0.3
datrie==0.8.2
decorator==4.4.2
docutils==0.16
gitdb==4.0.5
GitPython==3.1.3
idna==2.9
importlib-metadata==1.6.1
ipython-genutils==0.2.0
jsonschema==3.2.0
jupyter-core==4.6.3
murmurhash==1.0.2
nbformat==5.0.6
numpy==1.18.5
pandas==1.0.4
plac==1.1.3
preshed==3.0.2
psutil==5.7.0
pyrsistent==0.16.0
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
ratelimiter==1.2.0.post0
requests==2.23.0
six==1.15.0
smmap==3.0.4
snakemake==5.19.2
spacy==2.2.4
srsly==1.0.2
thinc==7.4.0
toposort==1.5
tqdm==4.46.1
traitlets==4.3.3
urllib3==1.25.9
wasabi==0.6.0
wrapt==1.12.1
zipp==3.1.0

License

The code in this repository is distributed under the MIT license.

Author

Markus Englund

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
gbif		gbif
virtual-herbarium		virtual-herbarium
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.rst		README.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preparation of data for Georg

Prerequisites

Input files

Output files

Running the workflows

License

Author

About

Releases

Packages

Languages

License

Naturhistoriska/georg-data

Folders and files

Latest commit

History

Repository files navigation

Preparation of data for Georg

Prerequisites

Input files

Output files

Running the workflows

License

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages