-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better User Agent and reduce transfer amounts for large booking systems #1
Comments
This might not be this repo, but definitely feels related to some Openactive crawler somewhere. |
Hi Nathan, Sorry to hear that you're experiencing some overly keen data harvesting. The status dashboard of this repo is a once daily job that just checks in on the first page of each feed, to assess for presence and accessibility, so there's no heavy duty data harvesting going on. Developers building harvesting tools for the first time may not know the full implications of their processing decisions. There are notes about this in the developer docs and elsewhere, but mistakes may still happen. We appreciate the inconvenience from such issues, and can support you to help determine the source and to then help developers with adjustments as necessary. We can also issue some messages across our channels to notify about such cases and highlight best practices, such as providing a User Agent with prescribed formatting. Do you have any further details about the requester? You mentioned that there's no "reasonable UA", but what is the User Agent information that you do have? Is there anything else that we may be able to draw on, including other request header info or even IP address? Is the crawl happening on the hour 24/7, or just at certain times? From what date did this start happening? What are the feed URLs in question? Anything at all would be useful. Finally, you mentioned that you have 10 million items in your feed(s) but only 3 million of them are currently published - do the other 7 million have a state of "deleted"? If so then, as per the retention recommendations, these can be removed from the feed after a short period. This would limit the data transfer that occurs on resync, as and when a data consumer does legitimately need to do so. Also, have you considered using a CDN? Outside of this thread, feel free to contact the ODI OpenActive team directly via [email protected] |
Hi Reikyo, It's good to know that this isn't the source of this crawler behaviour, but it's interesting that lots of agents are making similar mistakes. This issue was only noticed when we accidentally introduced a performance issue in our feeds, but it's worth pointing out that many data providers may have the same issue with bandwidth. We've got some plans to introduce a cache layer for this, but picking a relevant cache time is tricky for this type of usage. Looking into it, these requests are all from AWS eu-west-1, in Ireland. It's from over 2k different IPs which makes me think this is either a Lambda or Automated job which spins up an instance, runs this feed crawler and then shuts down. The client appears to be a Ruby client running each hour, pulling the entire feed. Here's a breakdown of the requests for the start of our feeds (
We don't have a load of deleted slots in the feed, these are removed after 7 days, but un-publishing is at a higher level. I think this possibly isn't the best place to discuss this, @howaskew do you know of a better place we could move this issue to? |
Hi all,
We've had to unpublish a lot of our feeds as we've been experiencing an issue which I believe is related to this dashboard. Firstly, it doesn't appear to publish a reasonable UA so this is all conjecture. I'd suggest having
Openactive (https://github.com/openactive/status-dashboard)
as the UA so people can quickly report issues if this crawler is causing issues or not responding torobots.txt
etc.The second issue is that we've noticed this dashboard doesn't appear to save any of the data it crawls. As this runs hourly, it pulls all availability through from the beginning of time. For smaller systems I imagine this isn't an issue but this has been causing us some problems. One of the main issues is we have about 10 million slots in our system (although only about 3 million of those are published at the moment). When this runs every hour it downloads all slots then discards them. This is using an enormous amount of bandwidth, and in the last 30 days has consumed over 160GB of data.
I'd suggest a simple SQLite or Redis DB to back this dashboard and save the
next
URL to avoid hammering booking systems with pulling their whole availability every hour, which was the entire point of the RPDE feeds in the first place.The text was updated successfully, but these errors were encountered: