Better User Agent and reduce transfer amounts for large booking systems #1

nathansalter · 2024-08-27T13:29:18Z

Hi all,

We've had to unpublish a lot of our feeds as we've been experiencing an issue which I believe is related to this dashboard. Firstly, it doesn't appear to publish a reasonable UA so this is all conjecture. I'd suggest having Openactive (https://github.com/openactive/status-dashboard) as the UA so people can quickly report issues if this crawler is causing issues or not responding to robots.txt etc.

The second issue is that we've noticed this dashboard doesn't appear to save any of the data it crawls. As this runs hourly, it pulls all availability through from the beginning of time. For smaller systems I imagine this isn't an issue but this has been causing us some problems. One of the main issues is we have about 10 million slots in our system (although only about 3 million of those are published at the moment). When this runs every hour it downloads all slots then discards them. This is using an enormous amount of bandwidth, and in the last 30 days has consumed over 160GB of data.

I'd suggest a simple SQLite or Redis DB to back this dashboard and save the next URL to avoid hammering booking systems with pulling their whole availability every hour, which was the entire point of the RPDE feeds in the first place.

The text was updated successfully, but these errors were encountered:

nathansalter · 2024-08-27T14:48:18Z

This might not be this repo, but definitely feels related to some Openactive crawler somewhere.

Reikyo · 2024-08-29T10:10:13Z

Hi Nathan,

Sorry to hear that you're experiencing some overly keen data harvesting. The status dashboard of this repo is a once daily job that just checks in on the first page of each feed, to assess for presence and accessibility, so there's no heavy duty data harvesting going on.

Developers building harvesting tools for the first time may not know the full implications of their processing decisions. There are notes about this in the developer docs and elsewhere, but mistakes may still happen. We appreciate the inconvenience from such issues, and can support you to help determine the source and to then help developers with adjustments as necessary. We can also issue some messages across our channels to notify about such cases and highlight best practices, such as providing a User Agent with prescribed formatting.

Do you have any further details about the requester? You mentioned that there's no "reasonable UA", but what is the User Agent information that you do have? Is there anything else that we may be able to draw on, including other request header info or even IP address? Is the crawl happening on the hour 24/7, or just at certain times? From what date did this start happening? What are the feed URLs in question? Anything at all would be useful.

Finally, you mentioned that you have 10 million items in your feed(s) but only 3 million of them are currently published - do the other 7 million have a state of "deleted"? If so then, as per the retention recommendations, these can be removed from the feed after a short period. This would limit the data transfer that occurs on resync, as and when a data consumer does legitimately need to do so. Also, have you considered using a CDN?

Outside of this thread, feel free to contact the ODI OpenActive team directly via [email protected]

nathansalter · 2024-09-03T13:15:18Z

Hi Reikyo,

It's good to know that this isn't the source of this crawler behaviour, but it's interesting that lots of agents are making similar mistakes. This issue was only noticed when we accidentally introduced a performance issue in our feeds, but it's worth pointing out that many data providers may have the same issue with bandwidth. We've got some plans to introduce a cache layer for this, but picking a relevant cache time is tricky for this type of usage.

Looking into it, these requests are all from AWS eu-west-1, in Ireland. It's from over 2k different IPs which makes me think this is either a Lambda or Automated job which spins up an instance, runs this feed crawler and then shuts down. The client appears to be a Ruby client running each hour, pulling the entire feed. Here's a breakdown of the requests for the start of our feeds (/slots with no afterTimestamp or afterId) in the last 30 days:

Finally, you mentioned that you have 10 million items in your feed(s) but only 3 million of them are currently published

We don't have a load of deleted slots in the feed, these are removed after 7 days, but un-publishing is at a higher level.

I think this possibly isn't the best place to discuss this, @howaskew do you know of a better place we could move this issue to?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better User Agent and reduce transfer amounts for large booking systems #1

Better User Agent and reduce transfer amounts for large booking systems #1

nathansalter commented Aug 27, 2024

nathansalter commented Aug 27, 2024

Reikyo commented Aug 29, 2024 •

edited

Loading

nathansalter commented Sep 3, 2024

Better User Agent and reduce transfer amounts for large booking systems #1

Better User Agent and reduce transfer amounts for large booking systems #1

Comments

nathansalter commented Aug 27, 2024

nathansalter commented Aug 27, 2024

Reikyo commented Aug 29, 2024 • edited Loading

nathansalter commented Sep 3, 2024

Reikyo commented Aug 29, 2024 •

edited

Loading