Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleHTTPResolver is getting 404 for every non-cached IIIF request #142

Open
regisrob opened this issue Jan 12, 2015 · 8 comments
Open

Comments

@regisrob
Copy link
Contributor

The Loris log file shows that the SimpleHTTPResolver by @scande3 is sending requests to the remote server for every single IIIF request that has not been previously cached.
Considering the docstring in resolver.py, I assumed that the resolver was making only one http request to retrieve the source image, copying it into the local cache, and using that local copy for every subsequent IIIF request sharing the same identifier.

But if you look at the log file below (in this example image identifer is "B452346101_C102/ecran/B452346101_C102_0005.jpg"), a new request returning 404 is sent every time you change a IIIF parameter (response will always be 404 since the remote server is not IIIF-enabled, by definition)

2015-01-12 14:11:17,453 (loris.resolver) [INFO]: Copied http://www.example.fr/B452346101_C102/ecran/B452346101_C102_0005.jpg to /path/to/local/cache/loris_cache.jpg

2015-01-12 14:14:50,942 (requests.packages.urllib3.connectionpool) [INFO]: Starting new HTTP connection (1): www.example.fr
2015-01-12 14:14:51,035 (requests.packages.urllib3.connectionpool) [DEBUG]: "GET /B452346101_C102/ecran/B452346101_C102_0005.jpg/full/,150/0/default.jpg HTTP/1.1" 404 None

2015-01-12 14:18:27,049 (requests.packages.urllib3.connectionpool) [INFO]: Starting new HTTP connection (1): www.example.fr
2015-01-12 14:18:27,078 (requests.packages.urllib3.connectionpool) [DEBUG]: "GET /B452346101_C102/ecran/B452346101_C102_0005.jpg/full/full/0/bitonal.jpg HTTP/1.1" 404 None

2015-01-12 14:40:49,003 (requests.packages.urllib3.connectionpool) [INFO]: Starting new HTTP connection (1): www.example.fr
2015-01-12 14:40:49,037 (requests.packages.urllib3.connectionpool) [DEBUG]: "GET /B452346101_C102/ecran/B452346101_C102_0005.jpg/50,80,250,250/full/0/default.png HTTP/1.1" 404 None

@jpstroop @scande3 : is it a normal behaviour? Isn't there a risk of server overload? (above all if you intend to use a viewer like OpenSeadragon which sends dozens of requests for each image)

@scande3
Copy link

scande3 commented Jan 12, 2015

@regisrob - The extra requests are from a thread here: #98 . It starts with the comment titled "scande3 commented on Nov 6, 2014" (near the middle) and goes for a few replies about the issue. From the @jpstroop response, it is setup that was due to difficulty resolving the url in the IIIF specification.

The best I was able to do was check the local cache first before attempting a request against the remote server. This is due to that call occurring in the "dissect_uri" method of resolver here: https://github.com/pulibrary/loris/blob/development/loris/webapp.py#L342 before any attempt to parse out if there are valid url parameters. As such, regardless of provider, it tries to see if the full url is an identifier itself first (which isn't then in the cache, obviously).

However... if someone wanted to rewrite the "dissect_uri" logic, then ideally it would attempt only using the full url as a last resort. This means the only extra HTTP request would occur from a malformed request to verify that the entire request wasn't a valid uri. The reason I didn't tackle this as that would require some significant re-engineering of how the application handles errors within a resolver to be feasible.

In terms of server overload, the extra "does it exist" requests aren't causing any issues for us. You can try it out with OpenSeadragon at http://www.digitalcommonwealth.org.

What is a server overload risk is a situation we had happen to us the other week: for some reason, Biblioboards decided to request the full JPG image of every single one of our 100,000+ objects that actually exist in our system (so not including objects we only have harvested metadata for). As only the JP2 exists, the image server needs to convert those in real time, and that quickly ate up hard drive space and slowed things down significantly for other users. We haven't yet figured out a way to prevent this from occurring again or how to handle such a situation gracefully.

@jpstroop
Copy link
Member

@scande3 See #141, re: full size images. I forget who it was that asked about this in the past, but, FWIW, it's definitely come up before.

I'd love to get to it, but my time is severely limited for about the next 6 months. Do you have any thoughts about where to implement it? Maybe in the resolver logic?

@regisrob
Copy link
Contributor Author

@scande3 Thanks a lot for your comprehensive answer, now I think I understood the logic of having this extra request.
So It should not be an issue for us too in terms of server overload (and we only have to deal with static jpeg)

@scande3
Copy link

scande3 commented Jan 13, 2015

@jpstroop - Unsure where to implement it just yet. In theory, the solution is the same for all of the resolvers (try the full request as a uri if it appears to be a bad request). Whether that is part of the base Resolver class or done above the Resolver level by catching errors to try the full uri in certain cases, I am not sure? I may be able to work on it in a few weeks.

It isn't a significant issue in that the performance hit of those extra requests is fairly minor which leaves it as a low priority. But it is not optimal and does add to one's logs.

@jpstroop
Copy link
Member

Sorry, I was talking about the problem you had of someone requesting all of your full-size images, and somehow adding a config option that restricts sizes to an upper-boundary, n% of the long dimension or something like that.

The reason for putting the logic in the resolver would be that a fancier resolver implementation might want to change this behavior by image or even based on user credentials.

@scande3
Copy link

scande3 commented Jan 13, 2015

@jpstroop - Ah, misunderstood. As a different issue, will reply to #141 to keep the threads cleaner.

@bcail
Copy link
Contributor

bcail commented Nov 17, 2016

@regisrob are you able to give this a try again? I think PRs 251 and 255 may have helped with the extra hits to the remote server.

@regisrob
Copy link
Contributor Author

Thank you @bcail, I will do my best to give it a try asap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants