Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ Investigate improvements to HEAD/GET handling in link scanner #732

Closed
2 tasks done
zacharykeeping opened this issue Oct 24, 2023 · 1 comment
Closed
2 tasks done
Assignees

Comments

@zacharykeeping
Copy link
Member

zacharykeeping commented Oct 24, 2023

Cc: @tombui99 @william-liebenberg @BrookJeynes

Hi team

Pain

As per our conversation with @BrookJeynes earlier, our current handling of using either HEAD or GET requests to check for broken links is becoming complicated and harder to manage. Since we recently updated the code to include all non-successful status code responses as broken links, the total counts for broken links on our sites have blown out massively, with many reported inaccurately due to HEAD requests being used. Currently we get around this by adding these links to our Unscannable Links list, but with so many more URLs to deal with now this solution does not seem to scale well.

I discussed a possible solution being that we continue to use HEAD requests by default, but remove the Unscannable Links list from the equation and instead only check GET as a fallback if the HEAD request fails. This seems like a decent middle ground between having to run GET on every request (thus having to retrieve much more data per request) and having to maintain a list of problematic domains. Brook also believes this might be a good idea for tackling this issue, so we should discuss this option and investigate that this resolves this issue without having too much of a negative impact on our scans.

Tasks

  • Investigate this option
  • Implement solution if it seems to solve our issue

Thanks!

@zacharykeeping
Copy link
Member Author

Done - with #742 we have moved to GET by default.

After testing it was noted that since our GET requests do not handle the response body, we're not actually fetching any more additional data than you would with a HEAD request. With that change, we're now getting more accurate results and cutting down on the amount of false positives.

Screenshot 2023-11-03 at 12 29 07 pm

Figure: The amount of broken links has dropped dramatically with more accurate results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant