OpenData Harvester

This project, developed as part of the Open Data Trentino project, is a suite of tools to allow easy importing of batches of datasets from data providers to data catalogs.

Build status:

Branch	Status
master
develop

Installation

Simply install the tarball from github:

pip install https://github.com/opendatatrentino/opendata-harvester/tarball/master

Or use the "vanity" url:

pip install https://git.io/harvester.tar.gz

if you plan to use it to import data to ckan, you'll need the Ckan API client too. To install the stable version from pypi:

pip install ckan-api-client

Or the latest from git:

pip install http://git.io/ckan-api-client.tar.gz

System dependencies

Several libraries are required to build dependencies. On debian:

apt-get install python-dev libxslt1-dev libxml2-dev

Concepts

This package will install a command-line script named harvester which can be used to perform all the needed operations.

The command is extensible by using entry points to provide additional plugins.

There are four plugin types that can be defined:

storage -- abstraction for different types of storage
crawler -- download data from source, store in storage
converter -- convert data from a storage to another one
importer -- import data from a storage to a catalog

Core plugins

Storages:

memory -- keep data in memory (mainly for testing)
jsondir -- keep data as json files in a directory (for local testing)
sqlite -- keep data in a sqlite database (for local testing)
mongodb -- keep data in a mongodb database (recommended for production)

Crawlers:

pat_statistica -- for ODT / servizio statistica
pat_statistica_subpro -- for ODT / servizio statistica
pat_geocatalogo -- for ODT / GeoCatalogo PAT
comunweb -- for ComunWeb sites

Converters:

pat_statistica_subpro_to_ckan -- for ODT / servizio statistica
pat_statistica_to_ckan -- for ODT / servizio statistica
pat_geocatalogo_to_ckan -- for ODT / GeoCatalogo PAT
comunweb_to_ckan -- convert from ComunWeb to Ckan

Importers:

ckan -- to import data into a ckan catalog.

Example usage

Download data to MongoDB:

harvester -vvv --debug crawl \
    --crawler pat_statistica \
	--storage mongodb://database.local/harvester_data/statistica

harvester -vvv --debug crawl \
    --crawler pat_statistica_subpro \
	--storage mongodb://database.local/harvester_data/statistica_subpro

Prepare data for insertion into ckan:

harvester -vvv --debug convert \
    --converter pat_statistica_to_ckan \
	--input mongodb://database.local/harvester_data/statistica \
	--output mongodb://database.local/harvester_data/statistica_clean

harvester -vvv --debug convert \
    --converter pat_statistica_subpro_to_ckan \
	--input mongodb://database.local/harvester_data/statistica_subpro \
	--output mongodb://database.local/harvester_data/statistica_subpro_clean

Actually load data to Ckan:

harvester -vvv --debug import \
	--storage mongodb://database.local/harvester_data/statistica_clean \
	--importer ckan+http://127.0.0.1:5000 \
	--importer-option api_key=00112233-4455-6677-8899-aabbccddeeff \
	--importer-option source_name=statistica

harvester -vvv --debug import \
	--storage mongodb://database.local/harvester_data/statistica_subpro_clean \
	--importer ckan+http://127.0.0.1:5000 \
	--importer-option api_key=00112233-4455-6677-8899-aabbccddeeff \
	--importer-option source_name=statistica_subpro

Running with debugger

Use something like this:

pdb $( which harvester ) -vvv --debug ....

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
docs		docs
harvester		harvester
harvester_odt		harvester_odt
scripts		scripts
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README.odt.md		README.odt.md
requirements-test.txt		requirements-test.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenData Harvester

Installation

System dependencies

Concepts

Core plugins

Example usage

Running with debugger

About

Releases

Packages

Languages

License

opendatatrentino/opendata-harvester

Folders and files

Latest commit

History

Repository files navigation

OpenData Harvester

Installation

System dependencies

Concepts

Core plugins

Example usage

Running with debugger

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages