-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
threaded queue fetcher limitations #281
Comments
(one of a no doubt continuing series of things I forgot): Half-baked idea: If large number of requests seen for a particular site (large number of requests couldn't be delayed in the last minute), consider lowering the minimum interval for that site (requires "fair share" to not hog the prefetch). This seems reasonable for sites that generate large numbers of URLs per day, There is code that attempts to back off when a 429 response is seen, but it's not well tested. |
Another issue/thought (which in theory applies to all queue workers, but in practice only really effects tqfetcher): When tqfetcher sees that a site/domain is "fully booked" with fetches for the next two minutes, it shunts any additional input requests to the "fetcher-fast" queue, where it will hang out for two minutes, and then be appended to the regular "fetcher-input" queue. In normal operation (hourly batches from rss-queuer), the new requests are run thru in the first part of the hour (shunting unreachable, and other "soft" errors to the -retry queue (after which they'll end up at the end of the -input queue). Something I coded for, from the very start, is to allow any queue worker to take input from multiple queues (so that retries can be processed as soon as their delay period ends, rather than going to the back of the line). The sticky bit is that ISTR that a Pika/RabbitMQ channel can only get messages from one queue, so I made sure that worker objects and channels are not wired into the code as 1:1. The place this MAY be visible right now: The fetches from old rss/csv files are taking over a day: My guess is that the first pass thru the queue is taking less than a day, but that:
The bottom line: |
A brain dump of where I left off.
tqfetcher behaves reasonably (fetches 10-15 stories/second) given a well mixed input queue. But if the workload is not well mixed (ie; historical URLs from a single site) are dropped into the input queue all at once, the thruput can drop to 1 story every 5 seconds.
RabbitMQ applications can specify the number of messages to be delivered to it until one of the pending messages is acknowledged (and removed from the input queue). The number is called the prefetch.
tqfetcher keeps a moving average of past req request times, but does not schedule fetches based on the available threads, rather it optimistically delays requests based on past request time and the time of the last request sent. When a past estimate is not available (average is zero), requests are not delayed.
After delay (or no delay) stories are placed into a work queue for distribution to worker threads. If the work queue accumulates ANY significant number of ready requests, the ability to control delay between requests to a site is SEVERELY compromised. Because of this the prefetch is kept very low (two messages per available worker thread).
Thoughts for improvements:
Right now, tqfetcher's MINIMUM delay between starting requests to any one site is 5 seconds.
THIS IS EXCEEDINGLY CONSERVATIVE! Scrapy may not have had ANY minimum delay!!!
HOWEVER: very low delay numbers would allow VERY large numbers of requests to a site
to be delayed (eating up the prefetch). A fix for this might be:
Limit the number of stories that can be delayed for a particular site to a "fair share" of the prefetch (perhaps
prefetch / active_sites
orprefetch / (active_sites + 1)
, whereactive_sites
is the number of sites successfully fetched from in the last minute (counted in the once a minute periodic task).Enforce a delay between first requests to any previously uncontacted site?
and/ Check the work queue length before adding a request (more than one request per thread is excessive).
The grandest idea/wish is to plan out all activity (within the next fast delay queue period interval) so that the work queue never gets large, and requests are delivered to the work queue at a measured rate that never exceeds available capacity. Initial requests to uncontacted sites should be treated as taking a long time (MAX_CONNECT + MAX_READ seconds?).
Consider whether it's possible to detect if ALL requests are failing due to Internet outage. The worst case is that if the outage is more than 10 hours, stuff gets dumped into the fetcher-quar(antine) queue, and has to be manually moved.
The text was updated successfully, but these errors were encountered: