This project, developed as part of the Open Data Trentino project, is a suite of tools to allow easy importing of batches of datasets from data providers to data catalogs.
Build status:
Branch | Status |
---|---|
master | |
develop |
Simply install the tarball from github:
pip install https://github.com/opendatatrentino/opendata-harvester/tarball/master
Or use the "vanity" url:
pip install https://git.io/harvester.tar.gz
if you plan to use it to import data to ckan, you'll need the Ckan API client too. To install the stable version from pypi:
pip install ckan-api-client
Or the latest from git:
pip install http://git.io/ckan-api-client.tar.gz
Several libraries are required to build dependencies. On debian:
apt-get install python-dev libxslt1-dev libxml2-dev
This package will install a command-line script named harvester
which
can be used to perform all the needed operations.
The command is extensible by using entry points to provide additional plugins.
There are four plugin types that can be defined:
- storage -- abstraction for different types of storage
- crawler -- download data from source, store in storage
- converter -- convert data from a storage to another one
- importer -- import data from a storage to a catalog
Storages:
memory
-- keep data in memory (mainly for testing)jsondir
-- keep data as json files in a directory (for local testing)sqlite
-- keep data in a sqlite database (for local testing)mongodb
-- keep data in a mongodb database (recommended for production)
Crawlers:
pat_statistica
-- for ODT / servizio statisticapat_statistica_subpro
-- for ODT / servizio statisticapat_geocatalogo
-- for ODT / GeoCatalogo PATcomunweb
-- for ComunWeb sites
Converters:
pat_statistica_subpro_to_ckan
-- for ODT / servizio statisticapat_statistica_to_ckan
-- for ODT / servizio statisticapat_geocatalogo_to_ckan
-- for ODT / GeoCatalogo PATcomunweb_to_ckan
-- convert from ComunWeb to Ckan
Importers:
ckan
-- to import data into a ckan catalog.
Download data to MongoDB:
harvester -vvv --debug crawl \
--crawler pat_statistica \
--storage mongodb://database.local/harvester_data/statistica
harvester -vvv --debug crawl \
--crawler pat_statistica_subpro \
--storage mongodb://database.local/harvester_data/statistica_subpro
Prepare data for insertion into ckan:
harvester -vvv --debug convert \
--converter pat_statistica_to_ckan \
--input mongodb://database.local/harvester_data/statistica \
--output mongodb://database.local/harvester_data/statistica_clean
harvester -vvv --debug convert \
--converter pat_statistica_subpro_to_ckan \
--input mongodb://database.local/harvester_data/statistica_subpro \
--output mongodb://database.local/harvester_data/statistica_subpro_clean
Actually load data to Ckan:
harvester -vvv --debug import \
--storage mongodb://database.local/harvester_data/statistica_clean \
--importer ckan+http://127.0.0.1:5000 \
--importer-option api_key=00112233-4455-6677-8899-aabbccddeeff \
--importer-option source_name=statistica
harvester -vvv --debug import \
--storage mongodb://database.local/harvester_data/statistica_subpro_clean \
--importer ckan+http://127.0.0.1:5000 \
--importer-option api_key=00112233-4455-6677-8899-aabbccddeeff \
--importer-option source_name=statistica_subpro
Use something like this:
pdb $( which harvester ) -vvv --debug ....