Skip to content

Processing data for use in the georeferencing tool Georg

License

Notifications You must be signed in to change notification settings

Naturhistoriska/georg-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preparation of data for Georg

This repository contains code for processing datasets before importing them into Georg, a georeferencing tool built on top of Pelias. The data processing is carried out with the workflow management system Snakemake and a few Python scripts.

There are currently two different data pipelines: one for GBIF Sweden datasets and one for Sweden's virtual herbarium.

For the GBIF workflow, we use Darwin Core archives obtained from http://gbif.se/ipt/. For the time being, three occurrence datasets are downloaded and processed:

nhrs-nrm:GBIF-Sweden, Entomological Collections (NHRS), Swedish Museum of Natural History (NRM). DOI: 10.15468/fpzyjx
s-fbo:GBIF-Sweden, Phanerogamic Botanical Collections (S). DOI: 10.15468/yo3mmu
uppsala-botany:GBIF-Sweden, Botany (UPS). DOI: 10.15468/ufmslw

From Sweden's Virtual Herbarium we use one dataset for socknar (socken in singular) in SQL format. Before processing the dataset, it is exported into a single TSV file. The source data can be obtained from: https://github.com/mossnisse/Virtuella-Herbariet/blob/master/SQL/samhall_district.sql.

Prerequisites

An easy way to get Python working on your computer is to install the free Anaconda distribution.

You can install the libraries with the following command:

pip install pandas snakemake spacy

Input files

Input files should be placed at the following locations:

  • ./gbif/data/raw/{dataset}/occurrence.txt
  • ./virtual-herbarium/data/raw/.

Output files

After executing the workflows, you should be able to find the output files in the following directories:

  • ./gbif/data/processed/
  • ./virtual-herbarium/data/processed/

Running the workflows

Navigate to the relevant subdirectory and enter the following on the command-line (adjust the number of CPU cores to fit your environment):

snakemake --cores 4

The file ./gbif/config.yaml determines which GBIF datasets to include, and how the included datasets are processed.

Named Entity Recognition (NER) is used to extract place names from texts in the GBIF pipeline. A language model that has been trained on transcripts of mainly Swedish labels is included in this repository.

The two workflows has been executed under Python 3.7 with the following Python packages installed:

appdirs==1.4.4
attrs==19.3.0
blis==0.4.1
catalogue==1.0.0
certifi==2020.4.5.1
chardet==3.0.4
ConfigArgParse==1.2.3
cymem==2.0.3
datrie==0.8.2
decorator==4.4.2
docutils==0.16
gitdb==4.0.5
GitPython==3.1.3
idna==2.9
importlib-metadata==1.6.1
ipython-genutils==0.2.0
jsonschema==3.2.0
jupyter-core==4.6.3
murmurhash==1.0.2
nbformat==5.0.6
numpy==1.18.5
pandas==1.0.4
plac==1.1.3
preshed==3.0.2
psutil==5.7.0
pyrsistent==0.16.0
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
ratelimiter==1.2.0.post0
requests==2.23.0
six==1.15.0
smmap==3.0.4
snakemake==5.19.2
spacy==2.2.4
srsly==1.0.2
thinc==7.4.0
toposort==1.5
tqdm==4.46.1
traitlets==4.3.3
urllib3==1.25.9
wasabi==0.6.0
wrapt==1.12.1
zipp==3.1.0

License

The code in this repository is distributed under the MIT license.

Author

Markus Englund

About

Processing data for use in the georeferencing tool Georg

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages