-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce crawler load on servers #1581
Conversation
The crawler has a hard time crawling all specs nowadays due to more stringent restrictions on servers that lead to network timeouts and errors. See: w3c/webref#1244 The goal of this update is to reduce the load of the crawler onto servers. Two changes: 1. The list of specs to crawl gets sorted to distribute origins. This should help with diluting requests sent to a specific server at once. The notion of "origin" used in the code is loose and more meant to identify the server that serves the resource than the actual origin. 2. Requests sent to a given origin are serialized, and sent 2 seconds minimum after the last request was sent (and processed). The crawler still processes the list 4 specs at a time otherwise (provided the specs are to be retrieved from different origins). The consequence of 1. is that the specs are no longer processed in order, so logs will make the crawler look a bit drunk, processing specs seemingly randomly, as in: ``` 1/610 - https://aomediacodec.github.io/afgs1-spec/ - crawling 8/610 - https://compat.spec.whatwg.org/ - crawling 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - crawling 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - crawling 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - done 16/610 - https://drafts.css-houdini.org/css-typed-om-2/ - crawling 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - done 45/610 - https://fidoalliance.org/specs/fido-v2.1-ps-20210615/fido-client-to-authenticator-protocol-v2.1-ps-errata-20220621.html - crawling https://compat.spec.whatwg.org/ [error] Multiple event handler named orientationchange, cannot associate reliably to an interface in Compatibility Standard 8/610 - https://compat.spec.whatwg.org/ - done 66/610 - https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html - crawling https://aomediacodec.github.io/afgs1-spec/ [log] extract refs without rules 1/610 - https://aomediacodec.github.io/afgs1-spec/ - done ```
mockAgent | ||
.get("https://www.w3.org") | ||
.intercept({ method: "GET", path: "/StyleSheets/TR/2021/dark.css" }) | ||
.reply(200, '') | ||
.persist(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to add: that's unrelated but now needed (I suspect the style sheet was added recently) to avoid a test complaint that request on this resource is not being mocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I'm not sure the interleaving/systematic pacing is necessarily optimal in terms of reducing processing time (see embedded comment); in any case, I think we should merge this as is and look for possible optimizations separately
} | ||
specsByOrigin[origin].push(spec); | ||
} | ||
const spreadList = interleave(...Object.values(specsByOrigin)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much the interleaving buys us vs managing a proper per-"origin" queue of requests that would let us process requests in parallel without pauses as long as there is no lock on the relevant origins.
I think e.g. https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 accomplishes this for the CG monitor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimization-wise, I don't know if either approach really trumps the other. From a readibility and maintenance perspective, a proper per-origin queue would be cleaner though! I was more trying to patch the existing code than to re-write it.
One constraint I think we need in Reffy is that parallelism needs to be kept to a reasonable level, say 4-5 specs in parallel maximum, because loading too many specs at once in Puppeteer may use a lot of RAM. If we just enqueue everything on a per-origin queue without throttling, we may end up in a situation where 10 specs from 10 different origins get processed at once.
Right now that throttling is done at the spec level, which is why I chose to shuffle the list instead, to reduce the likelihood that the 4 specs that get processed in parallel target the same origin server. We'd need to apply that throttling to the list of origins instead (process 4 origins in parallel among the list of origins). That can also very likely be done "properly" ;)
For info, I merged the PR but haven't managed to run a successful crawl locally so far because the drafts CSS server keeps crashing. That may signal that crawl needs to be slowed further for CSS drafts. I'm not releasing a new version of Reffy right away as a result. |
Breaking change: - Bump required version of Node.js to >=20.12.1 (#1590) Required by dependency on ReSpec, although note this dependency is only for tests (`devDependencies` in `package.json`). New features: - Reduce crawler load on servers (#1587, #1581) - Include data-xref-type as another hint for autolinks (#1585) Feature patches: - Skip requests on SVG and CSS resources (#1588) - Fix Puppeteer instance teardown (#1586) Dependency bumps: - Bump respec from 35.0.2 to 35.1.0 (#1583) - Bump ajv from 8.15.0 to 8.16.0 (#1580)
[Note: not fully tested for now because the drafts CSS server is currently down]
The crawler has a hard time crawling all specs nowadays due to more stringent restrictions on servers that lead to network timeouts and errors. See: w3c/webref#1244
The goal of this update is to reduce the load of the crawler onto servers. Two changes:
The list of specs to crawl gets sorted to distribute origins. This should help with diluting requests sent to a specific server at once. The notion of "origin" used in the code is loose and more meant to identify the server that serves the resource than the actual origin.
Requests sent to a given origin are serialized, and sent 2 seconds minimum after the last request was sent (and processed). The crawler still processes the list 4 specs at a time otherwise (provided the specs are to be retrieved from different origins).
The consequence of 1. is that the specs are no longer processed in order, so logs will make the crawler look a bit drunk, processing specs seemingly randomly, as in: