Reduce crawler load on servers #1581

tidoust · 2024-06-05T15:39:47Z

[Note: not fully tested for now because the drafts CSS server is currently down]

The crawler has a hard time crawling all specs nowadays due to more stringent restrictions on servers that lead to network timeouts and errors. See: w3c/webref#1244

The goal of this update is to reduce the load of the crawler onto servers. Two changes:

The list of specs to crawl gets sorted to distribute origins. This should help with diluting requests sent to a specific server at once. The notion of "origin" used in the code is loose and more meant to identify the server that serves the resource than the actual origin.
Requests sent to a given origin are serialized, and sent 2 seconds minimum after the last request was sent (and processed). The crawler still processes the list 4 specs at a time otherwise (provided the specs are to be retrieved from different origins).

The consequence of 1. is that the specs are no longer processed in order, so logs will make the crawler look a bit drunk, processing specs seemingly randomly, as in:

  1/610 - https://aomediacodec.github.io/afgs1-spec/ - crawling
  8/610 - https://compat.spec.whatwg.org/ - crawling
 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - crawling
 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - crawling
 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - done
 16/610 - https://drafts.css-houdini.org/css-typed-om-2/ - crawling
 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - done
 45/610 - https://fidoalliance.org/specs/fido-v2.1-ps-20210615/fido-client-to-authenticator-protocol-v2.1-ps-errata-20220621.html - crawling
https://compat.spec.whatwg.org/ [error] Multiple event handler named orientationchange, cannot associate reliably to an interface in Compatibility Standard
  8/610 - https://compat.spec.whatwg.org/ - done
 66/610 - https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html - crawling
https://aomediacodec.github.io/afgs1-spec/ [log] extract refs without rules
  1/610 - https://aomediacodec.github.io/afgs1-spec/ - done

The crawler has a hard time crawling all specs nowadays due to more stringent restrictions on servers that lead to network timeouts and errors. See: w3c/webref#1244 The goal of this update is to reduce the load of the crawler onto servers. Two changes: 1. The list of specs to crawl gets sorted to distribute origins. This should help with diluting requests sent to a specific server at once. The notion of "origin" used in the code is loose and more meant to identify the server that serves the resource than the actual origin. 2. Requests sent to a given origin are serialized, and sent 2 seconds minimum after the last request was sent (and processed). The crawler still processes the list 4 specs at a time otherwise (provided the specs are to be retrieved from different origins). The consequence of 1. is that the specs are no longer processed in order, so logs will make the crawler look a bit drunk, processing specs seemingly randomly, as in: ``` 1/610 - https://aomediacodec.github.io/afgs1-spec/ - crawling 8/610 - https://compat.spec.whatwg.org/ - crawling 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - crawling 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - crawling 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - done 16/610 - https://drafts.css-houdini.org/css-typed-om-2/ - crawling 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - done 45/610 - https://fidoalliance.org/specs/fido-v2.1-ps-20210615/fido-client-to-authenticator-protocol-v2.1-ps-errata-20220621.html - crawling https://compat.spec.whatwg.org/ [error] Multiple event handler named orientationchange, cannot associate reliably to an interface in Compatibility Standard 8/610 - https://compat.spec.whatwg.org/ - done 66/610 - https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html - crawling https://aomediacodec.github.io/afgs1-spec/ [log] extract refs without rules 1/610 - https://aomediacodec.github.io/afgs1-spec/ - done ```

tidoust · 2024-06-05T16:29:27Z

src/lib/mock-server.js

+mockAgent
+  .get("https://www.w3.org")
+  .intercept({ method: "GET", path: "/StyleSheets/TR/2021/dark.css" })
+  .reply(200, '')
+  .persist();
+


Forgot to add: that's unrelated but now needed (I suspect the style sheet was added recently) to avoid a test complaint that request on this resource is not being mocked.

dontcallmedom

LGTM, although I'm not sure the interleaving/systematic pacing is necessarily optimal in terms of reducing processing time (see embedded comment); in any case, I think we should merge this as is and look for possible optimizations separately

dontcallmedom · 2024-06-05T16:40:29Z

src/lib/specs-crawler.js

+        }
+        specsByOrigin[origin].push(spec);
+    }
+    const spreadList = interleave(...Object.values(specsByOrigin));


I wonder how much the interleaving buys us vs managing a proper per-"origin" queue of requests that would let us process requests in parallel without pauses as long as there is no lock on the relevant origins.

I think e.g. https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 accomplishes this for the CG monitor

Optimization-wise, I don't know if either approach really trumps the other. From a readibility and maintenance perspective, a proper per-origin queue would be cleaner though! I was more trying to patch the existing code than to re-write it.

One constraint I think we need in Reffy is that parallelism needs to be kept to a reasonable level, say 4-5 specs in parallel maximum, because loading too many specs at once in Puppeteer may use a lot of RAM. If we just enqueue everything on a per-origin queue without throttling, we may end up in a situation where 10 specs from 10 different origins get processed at once.

Right now that throttling is done at the spec level, which is why I chose to shuffle the list instead, to reduce the likelihood that the 4 specs that get processed in parallel target the same origin server. We'd need to apply that throttling to the list of origins instead (process 4 origins in parallel among the list of origins). That can also very likely be done "properly" ;)

tidoust · 2024-06-06T09:54:26Z

[Note: not fully tested for now because the drafts CSS server is currently down]

For info, I merged the PR but haven't managed to run a successful crawl locally so far because the drafts CSS server keeps crashing. That may signal that crawl needs to be slowed further for CSS drafts. I'm not releasing a new version of Reffy right away as a result.

Breaking change: - Bump required version of Node.js to >=20.12.1 (#1590) Required by dependency on ReSpec, although note this dependency is only for tests (`devDependencies` in `package.json`). New features: - Reduce crawler load on servers (#1587, #1581) - Include data-xref-type as another hint for autolinks (#1585) Feature patches: - Skip requests on SVG and CSS resources (#1588) - Fix Puppeteer instance teardown (#1586) Dependency bumps: - Bump respec from 35.0.2 to 35.1.0 (#1583) - Bump ajv from 8.15.0 to 8.16.0 (#1580)

tidoust requested a review from dontcallmedom June 5, 2024 15:39

tidoust commented Jun 5, 2024

View reviewed changes

dontcallmedom approved these changes Jun 5, 2024

View reviewed changes

tidoust mentioned this pull request Jun 6, 2024

Re-write crawler's main loop using proper queues #1582

Closed

tidoust merged commit c63cb0a into main Jun 6, 2024
1 check passed

tidoust deleted the slowdown branch June 6, 2024 05:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce crawler load on servers #1581

Reduce crawler load on servers #1581

tidoust commented Jun 5, 2024

tidoust Jun 5, 2024

dontcallmedom left a comment

dontcallmedom Jun 5, 2024

tidoust Jun 6, 2024

tidoust commented Jun 6, 2024

Reduce crawler load on servers #1581

Reduce crawler load on servers #1581

Conversation

tidoust commented Jun 5, 2024

tidoust Jun 5, 2024

Choose a reason for hiding this comment

dontcallmedom left a comment

Choose a reason for hiding this comment

dontcallmedom Jun 5, 2024

Choose a reason for hiding this comment

tidoust Jun 6, 2024

Choose a reason for hiding this comment

tidoust commented Jun 6, 2024