Don't treat sites with 403 status codes as broken links? #1157

mre · 2023-07-13T18:22:05Z

In a recent test of the top 1000 websites, I found the following 403 status errors:

✗ [403] https://cell.com/ | Failed: Network error: Forbidden
✗ [403] https://ticketmaster.com/ | Failed: Network error: Forbidden
✗ [403] https://kraken.com/ | Failed: Network error: Forbidden
✗ [403] https://media.giphy.com/ | Failed: Network error: Forbidden
✗ [403] https://yummly.com/ | Failed: Network error: Forbidden
✗ [403] https://bhphotovideo.com/ | Failed: Network error: Forbidden
✗ [403] https://inc.com/ | Failed: Network error: Forbidden
✗ [403] https://science.sciencemag.org/ | Failed: Network error: Forbidden
✗ [403] https://support.cloudflare.com/ | Failed: Network error: Forbidden
✗ [403] https://docs.wixstatic.com/ | Failed: Network error: Forbidden
✗ [403] https://deviantart.com/ | Failed: Network error: Forbidden
✗ [403] https://use.fontawesome.com/ | Failed: Network error: Forbidden
✗ [403] https://yoursite.com/ | Failed: Network error: Forbidden
✗ [403] https://upwork.com/ | Failed: Network error: Forbidden
✗ [403] https://gleam.io/ | Failed: Network error: Forbidden
✗ [403] https://kickstarter.com/ | Failed: Network error: Forbidden
✗ [403] https://static.wixstatic.com/ | Failed: Network error: Forbidden
✗ [403] https://onlinelibrary.wiley.com/ | Failed: Network error: Forbidden
✗ [403] https://ericsson.com/ | Failed: Network error: Forbidden
✗ [403] https://journals.sagepub.com/ | Failed: Network error: Forbidden
✗ [403] https://kstatic.googleusercontent.com/ | Failed: Network error: Forbidden
✗ [403] https://coinbase.com/ | Failed: Network error: Forbidden
✗ [403] https://zazzle.com/ | Failed: Network error: Forbidden
✗ [403] https://thelancet.com/ | Failed: Network error: Forbidden
✗ [403] https://bluehost.com/ | Failed: Network error: Forbidden
✗ [403] https://tandfonline.com/ | Failed: Network error: Forbidden
✗ [403] https://event.on24.com/ | Failed: Network error: Forbidden
✗ [403] https://hostgator.com/ | Failed: Network error: Forbidden
✗ [403] https://pixabay.com/ | Failed: Network error: Forbidden
✗ [403] https://avvo.com/ | Failed: Network error: Forbidden
✗ [403] https://pexels.com/ | Failed: Network error: Forbidden
✗ [403] https://canva.com/ | Failed: Network error: Forbidden
✗ [403] https://sciencemag.org/ | Failed: Network error: Forbidden
✗ [403] https://codepen.io/ | Failed: Network error: Forbidden
✗ [403] https://fastcompany.com/ | Failed: Network error: Forbidden
✗ [403] https://homedepot.com/ | Failed: Network error: Forbidden
✗ [403] https://udemy.com/ | Failed: Network error: Forbidden
✗ [403] https://oecd.org/ | Failed: Network error: Forbidden
✗ [403] https://capterra.com/ | Failed: Network error: Forbidden
✗ [403] https://nejm.org/ | Failed: Network error: Forbidden
✗ [403] https://creativemarket.com/ | Failed: Network error: Forbidden
✗ [403] https://s-media-cache-ak0.pinimg.com/ | Failed: Network error: Forbidden

These are websites, which block lychee for one reason or another. For example, they use browser fingerprinting, bot detection, block unknown user agents etc. The list is long.

Given that most of these links work, I vote for treating websites with 403 status codes as non-failing links to reduce the number of false positives. Any objections?

The text was updated successfully, but these errors were encountered:

Techassi · 2023-07-15T12:04:14Z

In my opinion, this behaviour should be configurable. There already is the option to set accepted status codes with --accept. Maybe this feature can be improved with something like this:

accept = ["100..=103", "200..=299", "403"]

With the default being something like:

accept = ["100..=399"]

mre · 2023-07-15T14:22:51Z

Why not use

accept = ["100..=103", "200..=299", "403"]

as the default?

Techassi · 2023-07-15T20:51:49Z

Oh yeah. That could also work. It might even be better than my suggestion! I will implement the range selectors in the accept config field / CLI flag if we agree.

mre · 2023-07-17T08:40:48Z

Yup, I think that sounds reasonable. Go ahead if you like.

lycheeverse/lychee#1157

* Switch Link Checker to Lychee It’s fast and can handle more links * Udemy reports 403 lycheeverse/lychee#1157 * trigger

lycheeverse/lychee#1157

…1167) Adds support for accept ranges discussed in #1157. This allows the user to specify custom HTTP status codes accepted during checking and thus will report as valid (not broken). The accept option only supports specifying status codes as a comma-separated list. With this PR, the option will accept a list of status code ranges formatted like this: ```toml accept = ["100..=103", "200..=299", "403"] ``` These combinations will be supported: `..<end>`, ` ..=<end>`, `<start>..<end>` and `<start>..=<end>`. The behavior is copied from the Rust Range like concepts: ``` ..<end>, includes 0 to <end> (exclusive) ..=<end>, includes 0 to <end> (inclusive) <start>..<end>, includes <start> to <end> (exclusive) <start>..=<end>, includes <start> to <end> (inclusive) ``` - Foundation and enhancements for accept ranges, including support for comma-separated strings and integration into the CLI. - Implementations and updates for AcceptSelector, including Default, Display, and serde defaults. - Address and fix various errors: clippy, cargo fmt, and tests. - Add more tests, address edge cases, and enhance error messaging, especially for TOML config parsing. - Update dependencies.

mre · 2023-09-17T19:42:02Z

This is fixed in a way more general and extensive way by @Techassi in #1167. 🥳
Now we also support ranges of status codes, e.g. --accept 200..=206,403 would accept 200 until (including) 206 and 403. Thanks!

denis-domanskii · 2024-02-13T11:34:55Z

@mre, I think it was a mistake.

I've been using lychee for years and today, because of this change, I missed many 403 links in production. I am using lychee to check link on our website, and some of the links were internal because of page author mistake, returning 403 to our external users. But lychee said that all links are SUCCESS.

lychee is a link checker, it verifies that links are working and accessible, which is wrong in 403 case, because they are not, and a real user can't access the target resource. Such links must be marked as failed.

We need to exclude 403 from default success codes.

mre · 2024-02-13T15:04:55Z

For reference, the code is here:

lychee/lychee-lib/src/types/accept/selector.rs

Lines 43 to 51 in 13f4339

    
           impl Default for AcceptSelector { 
        
               fn default() -> Self { 
        
                   Self::new_from(vec![ 
        
                       AcceptRange::new(100, 103), 
        
                       AcceptRange::new(200, 299), 
        
                       AcceptRange::new(403, 403), 
        
                   ]) 
        
               } 
        
           }

As a workaround, you can overwrite the status codes, of course:

lychee --accept 200 ...

Seems like a lot of sites fail because of browser fingerprinting (Cloudflare protection), as described here:
raviqqe/muffet#189

Survey

I took a brief look into how other link checkers handle the situation:

muffet: ❌ error, planning to add override
broken-link-checker: ❌ error, workaround for override
linkinator: ❌, no override
linkchecker: ❌, no override
fink: ❌, no override
markdown-link-check: ❌, override (--aliveStatusCodes)

Summary

So I think we should revert the change. I'm willing to accept a pull request.

denis-domanskii · 2024-02-13T15:40:42Z

Thank you, I'll do the change tonight.

As suggested in [#1157](lycheeverse/lychee#1157) and merged in [#1377](lycheeverse/lychee#1377), the HTTP status code 403 was removed from the default configuration. With this change, the default configuration in the docs will follow what is described at [README#Commandline Parameters](https://github.com/lycheeverse/lychee?tab=readme-ov-file#commandline-parametersy).

As suggested in [#1157](lycheeverse/lychee#1157) and merged in [#1377](lycheeverse/lychee#1377), the HTTP status code 403 was removed from the default configuration. With this change, the default configuration in the docs will follow what is described at Commandline Parameters in the README of lycheeverse/lychee.

mre added question Further information is requested request-for-comments labels Jul 13, 2023

mre changed the title ~~Don't treat sites with 403 status codes as broken links~~ Don't treat sites with 403 status codes as broken links? Jul 13, 2023

Techassi mentioned this issue Jul 17, 2023

feat: Add support for ranges in the --accept option / config field #1167

Merged

nathany added a commit to dariubs/GoBooks that referenced this issue Sep 2, 2023

Udemy reports 403

850f6b9

lycheeverse/lychee#1157

nathany added a commit to dariubs/GoBooks that referenced this issue Sep 2, 2023

Udemy reports 403

a3ea761

lycheeverse/lychee#1157

nathany added a commit to dariubs/GoBooks that referenced this issue Sep 2, 2023

Switch Link Checker to Lychee (#137)

9457b93

* Switch Link Checker to Lychee It’s fast and can handle more links * Udemy reports 403 lycheeverse/lychee#1157 * trigger

nathany added a commit to dariubs/GoBooks that referenced this issue Sep 3, 2023

lychee: accept 401

9c0798c

lycheeverse/lychee#1157

mre closed this as completed Sep 17, 2023

mre reopened this Feb 13, 2024

DeniDoman mentioned this issue Feb 13, 2024

fix: Treat sites with 403 status codes as broken links #1377

Merged

Kurt-von-Laven mentioned this issue Feb 20, 2024

chore(deps): lock file maintenance ScribeMD/.github#364

Merged

1 task

mre added good first issue Good for newcomers help wanted Extra attention is needed and removed question Further information is requested request-for-comments labels May 15, 2024

soredake mentioned this issue Jun 10, 2024

Feature request: add FlareSolverr solver to bypass CloudFlare protection #1439

Open

kemingy mentioned this issue Oct 16, 2024

chore: upgrade lychee action and accept 403 forbidden mosecorg/mosec#580

Merged

leakedmemory mentioned this issue Nov 6, 2024

fix: remove 403 status code from default config lycheeverse/lycheeverse.github.io#82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't treat sites with 403 status codes as broken links? #1157

Don't treat sites with 403 status codes as broken links? #1157

mre commented Jul 13, 2023

Techassi commented Jul 15, 2023

mre commented Jul 15, 2023

Techassi commented Jul 15, 2023 •

edited

Loading

mre commented Jul 17, 2023

mre commented Sep 17, 2023

denis-domanskii commented Feb 13, 2024 •

edited

Loading

mre commented Feb 13, 2024

denis-domanskii commented Feb 13, 2024

Don't treat sites with 403 status codes as broken links? #1157

Don't treat sites with 403 status codes as broken links? #1157

Comments

mre commented Jul 13, 2023

Techassi commented Jul 15, 2023

mre commented Jul 15, 2023

Techassi commented Jul 15, 2023 • edited Loading

mre commented Jul 17, 2023

mre commented Sep 17, 2023

denis-domanskii commented Feb 13, 2024 • edited Loading

mre commented Feb 13, 2024

Survey

Summary

denis-domanskii commented Feb 13, 2024

Techassi commented Jul 15, 2023 •

edited

Loading

denis-domanskii commented Feb 13, 2024 •

edited

Loading