muffet generates 403 on pixabay.com #189

c33s · 2021-11-09T21:12:27Z

my site links to https://pixabay.com/ and if i check it with muffet it leads to a 403. tried to set a custom header but i still get a 403

--header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"

i don't want to scrape pixabay but i would like to check all external links on my site. currently my workaround is to exclude the site.

same problem for pexels.com.

looks like it is not really a problem of muffet as wget also produces a 403 but maybe muffet can do something about it

The text was updated successfully, but these errors were encountered:

ruzickap · 2022-02-26T07:15:28Z

Seems like it is by default.
If you use curl - you got it as well:

curl -v https://pixabay.com/
...
> GET / HTTP/2
> Host: pixabay.com
> user-agent: curl/7.54.0
> accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 403
< date: Sat, 26 Feb 2022 07:12:19 GMT
...

Same results with

curl -v --header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0' https://pixabay.com/

I'm using: --exclude=(linkedin.com|pixabay.com|html5up.net)

maxmeyer · 2022-02-27T17:09:41Z

curl https://www.cyberciti.biz/

has the same problem. Both sites - cyberciti and pixabay - secure their sites with cloudflare.

If you check the output, Cloudflare tries to check “browser”.

      <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>

            <p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
          </div>

This post about curl and Cloudflare might shed some more light on the issue: https://community.cloudflare.com/t/curl-command-getting-403-in-my-subdomain/299578/10.

This pull request adds a new optional argument `--status-codes` to make the accepted HTTP response status codes configurable and solves #189 and #291. I use muffet to check all links on https://tinylog.org/. However, some websites (e.g. https://stackoverflow.com, https://www.baeldung.com, and https://mkyong.com/) respond with status code 403 instead of 200 to muffet. Therefore, I would like to accept 403 as valid HTTP response status code.

raviqqe added the question Further information is requested label Nov 10, 2021

mre mentioned this issue Feb 13, 2024

Don't treat sites with 403 status codes as broken links? lycheeverse/lychee#1157

Open

pmwmedia mentioned this issue Feb 22, 2024

Make accepted HTTP response status codes configurable #364

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

muffet generates 403 on pixabay.com #189

muffet generates 403 on pixabay.com #189

c33s commented Nov 9, 2021

ruzickap commented Feb 26, 2022

maxmeyer commented Feb 27, 2022 •

edited

Loading

muffet generates 403 on pixabay.com #189

muffet generates 403 on pixabay.com #189

Comments

c33s commented Nov 9, 2021

ruzickap commented Feb 26, 2022

maxmeyer commented Feb 27, 2022 • edited Loading

maxmeyer commented Feb 27, 2022 •

edited

Loading