Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

muffet generates 403 on pixabay.com #189

Open
c33s opened this issue Nov 9, 2021 · 2 comments
Open

muffet generates 403 on pixabay.com #189

c33s opened this issue Nov 9, 2021 · 2 comments
Labels
question Further information is requested

Comments

@c33s
Copy link

c33s commented Nov 9, 2021

my site links to https://pixabay.com/ and if i check it with muffet it leads to a 403. tried to set a custom header but i still get a 403

--header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"

i don't want to scrape pixabay but i would like to check all external links on my site. currently my workaround is to exclude the site.

same problem for pexels.com.

looks like it is not really a problem of muffet as wget also produces a 403 but maybe muffet can do something about it

@raviqqe raviqqe added the question Further information is requested label Nov 10, 2021
@ruzickap
Copy link

Seems like it is by default.
If you use curl - you got it as well:

curl -v https://pixabay.com/
...
> GET / HTTP/2
> Host: pixabay.com
> user-agent: curl/7.54.0
> accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 403
< date: Sat, 26 Feb 2022 07:12:19 GMT
...

Same results with

curl -v --header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0' https://pixabay.com/

I'm using: --exclude=(linkedin.com|pixabay.com|html5up.net)

@maxmeyer
Copy link

maxmeyer commented Feb 27, 2022

curl https://www.cyberciti.biz/

has the same problem. Both sites - cyberciti and pixabay - secure their sites with cloudflare.

If you check the output, Cloudflare tries to check “browser”.

      <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>

            <p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
          </div>

This post about curl and Cloudflare might shed some more light on the issue: https://community.cloudflare.com/t/curl-command-getting-403-in-my-subdomain/299578/10.

raviqqe pushed a commit that referenced this issue Feb 28, 2024
This pull request adds a new optional argument `--status-codes` to make
the accepted HTTP response status codes configurable and solves #189 and
#291.

I use muffet to check all links on https://tinylog.org/. However, some
websites (e.g. https://stackoverflow.com, https://www.baeldung.com, and
https://mkyong.com/) respond with status code 403 instead of 200 to
muffet. Therefore, I would like to accept 403 as valid HTTP response
status code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants