Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't treat sites with 403 status codes as broken links? #1157

Open
mre opened this issue Jul 13, 2023 · 8 comments
Open

Don't treat sites with 403 status codes as broken links? #1157

mre opened this issue Jul 13, 2023 · 8 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@mre
Copy link
Member

mre commented Jul 13, 2023

In a recent test of the top 1000 websites, I found the following 403 status errors:

✗ [403] https://cell.com/ | Failed: Network error: Forbidden
✗ [403] https://ticketmaster.com/ | Failed: Network error: Forbidden
✗ [403] https://kraken.com/ | Failed: Network error: Forbidden
✗ [403] https://media.giphy.com/ | Failed: Network error: Forbidden
✗ [403] https://yummly.com/ | Failed: Network error: Forbidden
✗ [403] https://bhphotovideo.com/ | Failed: Network error: Forbidden
✗ [403] https://inc.com/ | Failed: Network error: Forbidden
✗ [403] https://science.sciencemag.org/ | Failed: Network error: Forbidden
✗ [403] https://support.cloudflare.com/ | Failed: Network error: Forbidden
✗ [403] https://docs.wixstatic.com/ | Failed: Network error: Forbidden
✗ [403] https://deviantart.com/ | Failed: Network error: Forbidden
✗ [403] https://use.fontawesome.com/ | Failed: Network error: Forbidden
✗ [403] https://yoursite.com/ | Failed: Network error: Forbidden
✗ [403] https://upwork.com/ | Failed: Network error: Forbidden
✗ [403] https://gleam.io/ | Failed: Network error: Forbidden
✗ [403] https://kickstarter.com/ | Failed: Network error: Forbidden
✗ [403] https://static.wixstatic.com/ | Failed: Network error: Forbidden
✗ [403] https://onlinelibrary.wiley.com/ | Failed: Network error: Forbidden
✗ [403] https://ericsson.com/ | Failed: Network error: Forbidden
✗ [403] https://journals.sagepub.com/ | Failed: Network error: Forbidden
✗ [403] https://kstatic.googleusercontent.com/ | Failed: Network error: Forbidden
✗ [403] https://coinbase.com/ | Failed: Network error: Forbidden
✗ [403] https://zazzle.com/ | Failed: Network error: Forbidden
✗ [403] https://thelancet.com/ | Failed: Network error: Forbidden
✗ [403] https://bluehost.com/ | Failed: Network error: Forbidden
✗ [403] https://tandfonline.com/ | Failed: Network error: Forbidden
✗ [403] https://event.on24.com/ | Failed: Network error: Forbidden
✗ [403] https://hostgator.com/ | Failed: Network error: Forbidden
✗ [403] https://pixabay.com/ | Failed: Network error: Forbidden
✗ [403] https://avvo.com/ | Failed: Network error: Forbidden
✗ [403] https://pexels.com/ | Failed: Network error: Forbidden
✗ [403] https://canva.com/ | Failed: Network error: Forbidden
✗ [403] https://sciencemag.org/ | Failed: Network error: Forbidden
✗ [403] https://codepen.io/ | Failed: Network error: Forbidden
✗ [403] https://fastcompany.com/ | Failed: Network error: Forbidden
✗ [403] https://homedepot.com/ | Failed: Network error: Forbidden
✗ [403] https://udemy.com/ | Failed: Network error: Forbidden
✗ [403] https://oecd.org/ | Failed: Network error: Forbidden
✗ [403] https://capterra.com/ | Failed: Network error: Forbidden
✗ [403] https://nejm.org/ | Failed: Network error: Forbidden
✗ [403] https://creativemarket.com/ | Failed: Network error: Forbidden
✗ [403] https://s-media-cache-ak0.pinimg.com/ | Failed: Network error: Forbidden

These are websites, which block lychee for one reason or another. For example, they use browser fingerprinting, bot detection, block unknown user agents etc. The list is long.

Given that most of these links work, I vote for treating websites with 403 status codes as non-failing links to reduce the number of false positives. Any objections?

@mre mre added question Further information is requested request-for-comments labels Jul 13, 2023
@mre mre changed the title Don't treat sites with 403 status codes as broken links Don't treat sites with 403 status codes as broken links? Jul 13, 2023
@Techassi
Copy link
Contributor

In my opinion, this behaviour should be configurable. There already is the option to set accepted status codes with --accept. Maybe this feature can be improved with something like this:

accept = ["100..=103", "200..=299", "403"]

With the default being something like:

accept = ["100..=399"]

@mre
Copy link
Member Author

mre commented Jul 15, 2023

Why not use

accept = ["100..=103", "200..=299", "403"]

as the default?

@Techassi
Copy link
Contributor

Techassi commented Jul 15, 2023

Oh yeah. That could also work. It might even be better than my suggestion! I will implement the range selectors in the accept config field / CLI flag if we agree.

@mre
Copy link
Member Author

mre commented Jul 17, 2023

Yup, I think that sounds reasonable. Go ahead if you like.

nathany added a commit to dariubs/GoBooks that referenced this issue Sep 2, 2023
nathany added a commit to dariubs/GoBooks that referenced this issue Sep 2, 2023
nathany added a commit to dariubs/GoBooks that referenced this issue Sep 2, 2023
* Switch Link Checker to Lychee

It’s fast and can handle more links

* Udemy reports 403

lycheeverse/lychee#1157

* trigger
nathany added a commit to dariubs/GoBooks that referenced this issue Sep 3, 2023
mre pushed a commit that referenced this issue Sep 17, 2023
…1167)

Adds support for accept ranges discussed in #1157. This allows the user to specify custom HTTP status codes accepted during checking and thus will report as valid (not broken). The accept option only supports specifying status codes as a comma-separated list. With this PR, the option will accept a list of status code ranges formatted like this:

```toml
accept = ["100..=103", "200..=299", "403"]
```

These combinations will be supported: `..<end>`, ` ..=<end>`, `<start>..<end>` and `<start>..=<end>`.
The behavior is copied from the Rust Range like concepts:

```
    ..<end>, includes 0 to <end> (exclusive)
    ..=<end>, includes 0 to <end> (inclusive)
    <start>..<end>, includes <start> to <end> (exclusive)
    <start>..=<end>, includes <start> to <end> (inclusive)
```


- Foundation and enhancements for accept ranges, including support for comma-separated strings and integration into the CLI.
- Implementations and updates for AcceptSelector, including Default, Display, and serde defaults.
- Address and fix various errors: clippy, cargo fmt, and tests.
- Add more tests, address edge cases, and enhance error messaging, especially for TOML config parsing.
- Update dependencies.
@mre
Copy link
Member Author

mre commented Sep 17, 2023

This is fixed in a way more general and extensive way by @Techassi in #1167. 🥳
Now we also support ranges of status codes, e.g. --accept 200..=206,403 would accept 200 until (including) 206 and 403. Thanks!

@mre mre closed this as completed Sep 17, 2023
@denis-domanskii
Copy link

denis-domanskii commented Feb 13, 2024

@mre, I think it was a mistake.

I've been using lychee for years and today, because of this change, I missed many 403 links in production. I am using lychee to check link on our website, and some of the links were internal because of page author mistake, returning 403 to our external users. But lychee said that all links are SUCCESS.

lychee is a link checker, it verifies that links are working and accessible, which is wrong in 403 case, because they are not, and a real user can't access the target resource. Such links must be marked as failed.

We need to exclude 403 from default success codes.

@mre
Copy link
Member Author

mre commented Feb 13, 2024

For reference, the code is here:

impl Default for AcceptSelector {
fn default() -> Self {
Self::new_from(vec![
AcceptRange::new(100, 103),
AcceptRange::new(200, 299),
AcceptRange::new(403, 403),
])
}
}

As a workaround, you can overwrite the status codes, of course:

lychee --accept 200 ...

Seems like a lot of sites fail because of browser fingerprinting (Cloudflare protection), as described here:
raviqqe/muffet#189

Survey

I took a brief look into how other link checkers handle the situation:

Summary

So I think we should revert the change. I'm willing to accept a pull request.

@mre mre reopened this Feb 13, 2024
@denis-domanskii
Copy link

Thank you, I'll do the change tonight.

@mre mre added good first issue Good for newcomers help wanted Extra attention is needed and removed question Further information is requested request-for-comments labels May 15, 2024
leakedmemory added a commit to leakedmemory/lycheeverse.github.io that referenced this issue Nov 6, 2024
As suggested in
[#1157](lycheeverse/lychee#1157) and merged in [#1377](lycheeverse/lychee#1377), the HTTP status code 403 was removed from the default
configuration.

With this change, the default configuration in the docs will follow what
is described at [README#Commandline Parameters](https://github.com/lycheeverse/lychee?tab=readme-ov-file#commandline-parametersy).
leakedmemory added a commit to leakedmemory/lycheeverse.github.io that referenced this issue Nov 6, 2024
As suggested in [#1157](lycheeverse/lychee#1157) and merged in [#1377](lycheeverse/lychee#1377), the HTTP status code 403 was removed from the default
configuration.

With this change, the default configuration in the docs will follow what
is described at Commandline Parameters in the README of lycheeverse/lychee.
mre pushed a commit to lycheeverse/lycheeverse.github.io that referenced this issue Nov 7, 2024
As suggested in [#1157](lycheeverse/lychee#1157) and merged in [#1377](lycheeverse/lychee#1377), the HTTP status code 403 was removed from the default
configuration.

With this change, the default configuration in the docs will follow what
is described at Commandline Parameters in the README of lycheeverse/lychee.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants