Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache data sources upon access #54

Open
acbart opened this issue Jan 30, 2020 · 3 comments
Open

Cache data sources upon access #54

acbart opened this issue Jan 30, 2020 · 3 comments

Comments

@acbart
Copy link
Contributor

acbart commented Jan 30, 2020

Hi, I was trying out some of the data sources, and I notice that some of them can take a while to run, while also requiring an active internet connection. I know this suggestion introduces further headaches, but perhaps you should consider setting up a cache for the non-real time datasets?

For the requests based datasets, this would be as trivial as adding in requests-cache:

import requests_cache
requests_cache.install_cache('bridges_datasets')

Along with some kind of helpful expire_cache() call for students to use if the remote data changes for whatever reason.

The SPARQLWrapper stuff would probably be a bit messier, since that's using urllib under the hood. But it probably wouldn't be too hard to just make a little decorator for it. Heck, you could probably even reuse the architecture for requests_cache and keep it all in one place.

If this seemed worthwhile, I'm willing to turn this into a Pull Request. But I wanted to get a sense of whether this is a worthwhile direction.

@AlecGoncharow
Copy link
Contributor

This is a good idea, thank you. We will need to do a bit of exploration before we can say it is something we can use without unintended side effects.

As it stands we are already caching some of the OSM data internally. Can you elaborate on which ones were slow on your end so we can investigate a bit further?

@krs-world
Copy link
Contributor

Cory, we are indeed caching some of the larger datasets like OpenStreetMap, the NOAA elevation map. Is there something more and better we should be doing?

@acbart
Copy link
Contributor Author

acbart commented Mar 24, 2020

It's a little tough to tell exactly what I was working on then, but I believe it was the WikiData dataset.

My perspective was that all datasets should be cached, with some clever mechanism for easily letting students clear out that local cache. I'm a little less worried about speed than I am about internet stability and the need to not worry about being connected and such. I was expecting something like what Sinbad does. There are headaches and issues, but it seemed like a worthwhile fight to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants