-
-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't treat sites with 403 status codes as broken links? #1157
Comments
In my opinion, this behaviour should be configurable. There already is the option to set accepted status codes with accept = ["100..=103", "200..=299", "403"] With the default being something like: accept = ["100..=399"] |
Why not use accept = ["100..=103", "200..=299", "403"] as the default? |
Oh yeah. That could also work. It might even be better than my suggestion! I will implement the range selectors in the accept config field / CLI flag if we agree. |
Yup, I think that sounds reasonable. Go ahead if you like. |
* Switch Link Checker to Lychee It’s fast and can handle more links * Udemy reports 403 lycheeverse/lychee#1157 * trigger
…1167) Adds support for accept ranges discussed in #1157. This allows the user to specify custom HTTP status codes accepted during checking and thus will report as valid (not broken). The accept option only supports specifying status codes as a comma-separated list. With this PR, the option will accept a list of status code ranges formatted like this: ```toml accept = ["100..=103", "200..=299", "403"] ``` These combinations will be supported: `..<end>`, ` ..=<end>`, `<start>..<end>` and `<start>..=<end>`. The behavior is copied from the Rust Range like concepts: ``` ..<end>, includes 0 to <end> (exclusive) ..=<end>, includes 0 to <end> (inclusive) <start>..<end>, includes <start> to <end> (exclusive) <start>..=<end>, includes <start> to <end> (inclusive) ``` - Foundation and enhancements for accept ranges, including support for comma-separated strings and integration into the CLI. - Implementations and updates for AcceptSelector, including Default, Display, and serde defaults. - Address and fix various errors: clippy, cargo fmt, and tests. - Add more tests, address edge cases, and enhance error messaging, especially for TOML config parsing. - Update dependencies.
@mre, I think it was a mistake. I've been using lychee for years and today, because of this change, I missed many 403 links in production. I am using lychee to check link on our website, and some of the links were internal because of page author mistake, returning 403 to our external users. But lychee said that all links are SUCCESS. lychee is a link checker, it verifies that links are working and accessible, which is wrong in 403 case, because they are not, and a real user can't access the target resource. Such links must be marked as failed. We need to exclude 403 from default success codes. |
For reference, the code is here: lychee/lychee-lib/src/types/accept/selector.rs Lines 43 to 51 in 13f4339
As a workaround, you can overwrite the status codes, of course: lychee --accept 200 ... Seems like a lot of sites fail because of browser fingerprinting (Cloudflare protection), as described here: SurveyI took a brief look into how other link checkers handle the situation:
SummarySo I think we should revert the change. I'm willing to accept a pull request. |
Thank you, I'll do the change tonight. |
As suggested in [#1157](lycheeverse/lychee#1157) and merged in [#1377](lycheeverse/lychee#1377), the HTTP status code 403 was removed from the default configuration. With this change, the default configuration in the docs will follow what is described at [README#Commandline Parameters](https://github.com/lycheeverse/lychee?tab=readme-ov-file#commandline-parametersy).
As suggested in [#1157](lycheeverse/lychee#1157) and merged in [#1377](lycheeverse/lychee#1377), the HTTP status code 403 was removed from the default configuration. With this change, the default configuration in the docs will follow what is described at Commandline Parameters in the README of lycheeverse/lychee.
As suggested in [#1157](lycheeverse/lychee#1157) and merged in [#1377](lycheeverse/lychee#1377), the HTTP status code 403 was removed from the default configuration. With this change, the default configuration in the docs will follow what is described at Commandline Parameters in the README of lycheeverse/lychee.
In a recent test of the top 1000 websites, I found the following 403 status errors:
These are websites, which block lychee for one reason or another. For example, they use browser fingerprinting, bot detection, block unknown user agents etc. The list is long.
Given that most of these links work, I vote for treating websites with 403 status codes as non-failing links to reduce the number of false positives. Any objections?
The text was updated successfully, but these errors were encountered: