Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py2store datasets #54

Open
thorwhalen opened this issue Mar 19, 2020 · 0 comments
Open

py2store datasets #54

thorwhalen opened this issue Mar 19, 2020 · 0 comments

Comments

@thorwhalen
Copy link
Member

This would be a separate py2store dependent repository.

The objective of this project is to offer easy and consistent access to various datasets.

We'll start with dataset providers that have a lot of data (so that we can get a lot out of the py2store wrapper we'll make for it).

The interface should start off as other hierarchical explorers such as for files (folders, subfolders, files) or DBs (e.g. mongo host>dbs>collections or sql connection>dbs>tables). For example, the first level of listing would list the data providers or other named groups (with a misc for the catch all unclassified). For example:

>>> list(data_malls)
['kaggle', 'who', 'roda', 'misc']
>>> datasets = data_malls['kaggle']
>>> list(datasets)
['us_food_habits', 'covid_19', ...
>>> dataset = datasets['covid_19']
# etc.

Check out if there's already a python lib to connect to the data provider (mall).
Check out API.
If API easy to use py2request (all we need is listing and download capabilities), use raw API. If not use python lib if available.

Caching

We want to use caching smartly and automatically (with automatic refreshes on a schedule, and/or warnings when a refresh hasn't happened for awhile.
We want to cache both listings as well as metadata and data.

Depending on the context, the cache could work in many ways. For example:

  • If listings are long, better cache them (and hope the API has some "anything new since DATE" function).
  • Even if listings are short, we'd like to cache them for offline use of the object.

Dataset providers

More links

https://www.freecodecamp.org/news/https-medium-freecodecamp-org-best-free-open-data-sources-anyone-can-use-a65b514b0f2d/

thorwhalen pushed a commit that referenced this issue Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant