♻️ Investigate improvements to HEAD/GET handling in link scanner #732

zacharykeeping · 2023-10-24T05:26:13Z

Cc: @tombui99 @william-liebenberg @BrookJeynes

Hi team

Pain

As per our conversation with @BrookJeynes earlier, our current handling of using either HEAD or GET requests to check for broken links is becoming complicated and harder to manage. Since we recently updated the code to include all non-successful status code responses as broken links, the total counts for broken links on our sites have blown out massively, with many reported inaccurately due to HEAD requests being used. Currently we get around this by adding these links to our Unscannable Links list, but with so many more URLs to deal with now this solution does not seem to scale well.

I discussed a possible solution being that we continue to use HEAD requests by default, but remove the Unscannable Links list from the equation and instead only check GET as a fallback if the HEAD request fails. This seems like a decent middle ground between having to run GET on every request (thus having to retrieve much more data per request) and having to maintain a list of problematic domains. Brook also believes this might be a good idea for tackling this issue, so we should discuss this option and investigate that this resolves this issue without having too much of a negative impact on our scans.

Tasks

Investigate this option
Implement solution if it seems to solve our issue

Thanks!

zacharykeeping · 2023-11-03T01:29:37Z

Done - with #742 we have moved to GET by default.

After testing it was noted that since our GET requests do not handle the response body, we're not actually fetching any more additional data than you would with a HEAD request. With that change, we're now getting more accurate results and cutting down on the amount of false positives.

Figure: The amount of broken links has dropped dramatically with more accurate results

zacharykeeping added the type: refactor label Oct 24, 2023

zacharykeeping mentioned this issue Oct 27, 2023

Improve link checking #742

Merged

zacharykeeping self-assigned this Oct 27, 2023

zacharykeeping closed this as completed Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

♻️ Investigate improvements to HEAD/GET handling in link scanner #732

♻️ Investigate improvements to HEAD/GET handling in link scanner #732

zacharykeeping commented Oct 24, 2023 •

edited

Loading

zacharykeeping commented Nov 3, 2023

♻️ Investigate improvements to HEAD/GET handling in link scanner #732

♻️ Investigate improvements to HEAD/GET handling in link scanner #732

Comments

zacharykeeping commented Oct 24, 2023 • edited Loading

Pain

Tasks

zacharykeeping commented Nov 3, 2023

zacharykeeping commented Oct 24, 2023 •

edited

Loading