py2store datasets #54

thorwhalen · 2020-03-19T13:09:28Z

This would be a separate py2store dependent repository.

The objective of this project is to offer easy and consistent access to various datasets.

We'll start with dataset providers that have a lot of data (so that we can get a lot out of the py2store wrapper we'll make for it).

The interface should start off as other hierarchical explorers such as for files (folders, subfolders, files) or DBs (e.g. mongo host>dbs>collections or sql connection>dbs>tables). For example, the first level of listing would list the data providers or other named groups (with a misc for the catch all unclassified). For example:

>>> list(data_malls)
['kaggle', 'who', 'roda', 'misc']
>>> datasets = data_malls['kaggle']
>>> list(datasets)
['us_food_habits', 'covid_19', ...
>>> dataset = datasets['covid_19']
# etc.

Check out if there's already a python lib to connect to the data provider (mall).
Check out API.
If API easy to use py2request (all we need is listing and download capabilities), use raw API. If not use python lib if available.

Caching

We want to use caching smartly and automatically (with automatic refreshes on a schedule, and/or warnings when a refresh hasn't happened for awhile.
We want to cache both listings as well as metadata and data.

Depending on the context, the cache could work in many ways. For example:

If listings are long, better cache them (and hope the API has some "anything new since DATE" function).
Even if listings are short, we'd like to cache them for offline use of the object.

Dataset providers

kaggle (see kaggle python lib)
WorldBank(see https://data.worldbank.org/)
WHO (https://www.who.int/gho/database/en/)
Google public data (https://www.google.com/publicdata/directory)
RODA (https://registry.opendata.aws/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

py2store datasets #54

py2store datasets #54

thorwhalen commented Mar 19, 2020

py2store datasets #54

py2store datasets #54

Comments

thorwhalen commented Mar 19, 2020

Caching

Dataset providers

More links