SimpleHTTPResolver is getting 404 for every non-cached IIIF request #142

regisrob · 2015-01-12T14:56:04Z

The Loris log file shows that the SimpleHTTPResolver by @scande3 is sending requests to the remote server for every single IIIF request that has not been previously cached.
Considering the docstring in resolver.py, I assumed that the resolver was making only one http request to retrieve the source image, copying it into the local cache, and using that local copy for every subsequent IIIF request sharing the same identifier.

But if you look at the log file below (in this example image identifer is "B452346101_C102/ecran/B452346101_C102_0005.jpg"), a new request returning 404 is sent every time you change a IIIF parameter (response will always be 404 since the remote server is not IIIF-enabled, by definition)

2015-01-12 14:11:17,453 (loris.resolver) [INFO]: Copied http://www.example.fr/B452346101_C102/ecran/B452346101_C102_0005.jpg to /path/to/local/cache/loris_cache.jpg

2015-01-12 14:14:50,942 (requests.packages.urllib3.connectionpool) [INFO]: Starting new HTTP connection (1): www.example.fr
2015-01-12 14:14:51,035 (requests.packages.urllib3.connectionpool) [DEBUG]: "GET /B452346101_C102/ecran/B452346101_C102_0005.jpg/full/,150/0/default.jpg HTTP/1.1" 404 None

2015-01-12 14:18:27,049 (requests.packages.urllib3.connectionpool) [INFO]: Starting new HTTP connection (1): www.example.fr
2015-01-12 14:18:27,078 (requests.packages.urllib3.connectionpool) [DEBUG]: "GET /B452346101_C102/ecran/B452346101_C102_0005.jpg/full/full/0/bitonal.jpg HTTP/1.1" 404 None

2015-01-12 14:40:49,003 (requests.packages.urllib3.connectionpool) [INFO]: Starting new HTTP connection (1): www.example.fr
2015-01-12 14:40:49,037 (requests.packages.urllib3.connectionpool) [DEBUG]: "GET /B452346101_C102/ecran/B452346101_C102_0005.jpg/50,80,250,250/full/0/default.png HTTP/1.1" 404 None

@jpstroop @scande3 : is it a normal behaviour? Isn't there a risk of server overload? (above all if you intend to use a viewer like OpenSeadragon which sends dozens of requests for each image)

scande3 · 2015-01-12T21:09:22Z

@regisrob - The extra requests are from a thread here: #98 . It starts with the comment titled "scande3 commented on Nov 6, 2014" (near the middle) and goes for a few replies about the issue. From the @jpstroop response, it is setup that was due to difficulty resolving the url in the IIIF specification.

The best I was able to do was check the local cache first before attempting a request against the remote server. This is due to that call occurring in the "dissect_uri" method of resolver here: https://github.com/pulibrary/loris/blob/development/loris/webapp.py#L342 before any attempt to parse out if there are valid url parameters. As such, regardless of provider, it tries to see if the full url is an identifier itself first (which isn't then in the cache, obviously).

However... if someone wanted to rewrite the "dissect_uri" logic, then ideally it would attempt only using the full url as a last resort. This means the only extra HTTP request would occur from a malformed request to verify that the entire request wasn't a valid uri. The reason I didn't tackle this as that would require some significant re-engineering of how the application handles errors within a resolver to be feasible.

In terms of server overload, the extra "does it exist" requests aren't causing any issues for us. You can try it out with OpenSeadragon at http://www.digitalcommonwealth.org.

What is a server overload risk is a situation we had happen to us the other week: for some reason, Biblioboards decided to request the full JPG image of every single one of our 100,000+ objects that actually exist in our system (so not including objects we only have harvested metadata for). As only the JP2 exists, the image server needs to convert those in real time, and that quickly ate up hard drive space and slowed things down significantly for other users. We haven't yet figured out a way to prevent this from occurring again or how to handle such a situation gracefully.

jpstroop · 2015-01-13T14:45:12Z

@scande3 See #141, re: full size images. I forget who it was that asked about this in the past, but, FWIW, it's definitely come up before.

I'd love to get to it, but my time is severely limited for about the next 6 months. Do you have any thoughts about where to implement it? Maybe in the resolver logic?

regisrob · 2015-01-13T14:51:18Z

@scande3 Thanks a lot for your comprehensive answer, now I think I understood the logic of having this extra request.
So It should not be an issue for us too in terms of server overload (and we only have to deal with static jpeg)

scande3 · 2015-01-13T15:29:44Z

@jpstroop - Unsure where to implement it just yet. In theory, the solution is the same for all of the resolvers (try the full request as a uri if it appears to be a bad request). Whether that is part of the base Resolver class or done above the Resolver level by catching errors to try the full uri in certain cases, I am not sure? I may be able to work on it in a few weeks.

It isn't a significant issue in that the performance hit of those extra requests is fairly minor which leaves it as a low priority. But it is not optimal and does add to one's logs.

jpstroop · 2015-01-13T15:33:08Z

Sorry, I was talking about the problem you had of someone requesting all of your full-size images, and somehow adding a config option that restricts sizes to an upper-boundary, n% of the long dimension or something like that.

The reason for putting the logic in the resolver would be that a fancier resolver implementation might want to change this behavior by image or even based on user credentials.

scande3 · 2015-01-13T15:37:41Z

@jpstroop - Ah, misunderstood. As a different issue, will reply to #141 to keep the threads cleaner.

bcail · 2016-11-17T16:50:56Z

@regisrob are you able to give this a try again? I think PRs 251 and 255 may have helped with the extra hits to the remote server.

regisrob · 2016-11-27T18:02:13Z

Thank you @bcail, I will do my best to give it a try asap

scande3 mentioned this issue Jan 13, 2015

Should be able to set a size threshold in config #141

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleHTTPResolver is getting 404 for every non-cached IIIF request #142

SimpleHTTPResolver is getting 404 for every non-cached IIIF request #142

regisrob commented Jan 12, 2015

scande3 commented Jan 12, 2015

jpstroop commented Jan 13, 2015

regisrob commented Jan 13, 2015

scande3 commented Jan 13, 2015

jpstroop commented Jan 13, 2015

scande3 commented Jan 13, 2015

bcail commented Nov 17, 2016

regisrob commented Nov 27, 2016

SimpleHTTPResolver is getting 404 for every non-cached IIIF request #142

SimpleHTTPResolver is getting 404 for every non-cached IIIF request #142

Comments

regisrob commented Jan 12, 2015

scande3 commented Jan 12, 2015

jpstroop commented Jan 13, 2015

regisrob commented Jan 13, 2015

scande3 commented Jan 13, 2015

jpstroop commented Jan 13, 2015

scande3 commented Jan 13, 2015

bcail commented Nov 17, 2016

regisrob commented Nov 27, 2016