Stuck in Cloudflare hCaptcha loop. #31

GermanEngineering · 2020-12-14T16:29:01Z

Hello and first of all thank you very much for your work!

It looks, like this is exactly the code that I was looking for, but unfortunately I'm not able to get it running because I get stuck in an endless Cloudflare hCaptcha loop on https://www.blinkist.com/en/nc/login when I'm trying to execute it the first time.
The "One more step - Please complete the security check to access - I am human" appears before entering the login information and no matter how often I solve it, I always end up at the next Captcha (tried it for at least 9 times in a row).

My system:

Win 10
Chrome 87
Python 3.8
Venv with all requirements.txt modules installed.

I've already tried:

Running it on another Win 10 Laptop --> same problem
Different commands: python blinkistscraper email password / python main.py email password
Downloaded and specified ChromeDriver 87.0.4280.88 as argument
Downloaded Chrome 88 Beta and used ChromeDriver 88.0.4324.27
pip install --upgrade for all outdated modules
Different locations via VPN (Germany, Portugal and US)
Different Networks (DSL and Hotspot from Mobile Phone)
Ubuntu VM --> also getting stuck with the same problem

Unfortunately I don't have any other ideas at the moment and feel pretty lost/stupid.
Did you encounter this problem before and have an idea how to solve it?
Or are there some logfiles or something I can collect that might help in this case?

Thank you very much in advance!
Peter

bckncook · 2020-12-15T23:20:58Z

Same issue here. Looking forward to solution. Thank you!!!

GermanEngineering · 2020-12-16T13:24:37Z

Hello again,

I tested two more things:

Tried to use cookies from chrome

logged in to blinkist in chrome
added chrome_options.add_argument("user-data-dir=C:\Users\Win10x64\AppData\Local\Google\Chrome\User Data\") argument to chomedriver to use the settings from chrome in chromedriver
executed get_login_cookies() to get cookies.pkl
started initial code with login cookies
gui mode is running into Captcha loop again
headless mode is running into timeout
[1608123763.485][INFO]: Waiting for pending navigations...
[1608123763.486][INFO]: Done waiting for pending navigations. Status: ok
[1608123763.493][INFO]: Waiting for pending navigations...
[1608123763.494][INFO]: Done waiting for pending navigations. Status: ok
[1608123763.494][INFO]: [6319a21f140a99f67240dc6507ddab98] RESPONSE FindElement ERROR no such element: Unable to locate element: {"method":"class name","selector":"main-banner-headline-v2"}
(Session info: headless chrome=87.0.4280.88)
[1608123764.001][INFO]: [6319a21f140a99f67240dc6507ddab98] COMMAND FindElement {
"sessionId": "6319a21f140a99f67240dc6507ddab98",
"using": "class name",
"value": "main-banner-headline-v2"
}

Tried selenium with Firefox

with driver = selenium.webdriver.Firefox()
--> also running into the same Captcha loop

Unfortunately nothing was successful, but maybe it helps to narrow down the root cause of the problem.
Thank you very much, again!
Peter

leoncvlt · 2020-12-17T11:29:04Z

It seems like Blinkist / Cloudflare moved from Goggle's captchas (which worked fine) to HCaptcha which causes this issue. From GermanEngineering's tests it seems like more of an issue of Cloudflare detecting the Chromedriver since even with legit cookies this persists. Will need to look into it - any help welcome!

GermanEngineering · 2020-12-17T14:51:09Z

I found a solution that at least allows me to login and download the text.
It doesn't seem to work in headless mode though.
And with the --audio option im running into the json.decoder.JSONDecodeError Exception.
I don't think that this is related to the change I made, but on the other hand I don't know if/how it was working before.

I tried to do a pull request, but I'm not really familiar with the GitHub process, so please excuse me if this is not the correct way to propose a change.
In the end it was just adding:
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
to the Chrome options in the scraper.py

Hope this helps.

wywywywy · 2020-12-17T16:04:26Z

That's weird. I tried all these options and it still won't let me through the hcaptcha.

    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation", "enable-logging"])
    chrome_options.add_experimental_option('useAutomationExtension', False)

wywywywy · 2020-12-17T17:08:23Z

It'd be much better to convert this from Selenium to Puppeteer.

I just tried Puppeteer and that works well, especially with the Stealth plugin.

rocketinventor · 2020-12-21T20:04:58Z

I think that there used to be a chrome extension from Cloudflare that bypasses their captcha page. Perhaps that would help? Has anyone tried it? wywywywy - Do you think that you could create a new branch with your changes and make a pull-request with the Puppeteer-based code? Thanks!

mikaelaatan · 2020-12-23T01:48:58Z

Hello, I'm not familiar with how Github works, but I'll just share what worked for me. I added chrome_options.add_argument("--disable-blink-features=AutomationControlled") from GermanEngineering's suggestion.

At first it worked, but for the next sessions, it started going back to the captcha again. The workaround is after logging in, and when it goes to the cloudfare site, redirect the browser back to Blinkist.com homepage. This is when the log says, "waiting for user to solve recaptcha and login. After that, the scraper will proceed as expected.

flowni · 2020-12-27T12:02:07Z

Hello, I encounter the same problem as you guys, getting stuck in the infinity captcha-loop...

I think we definitely have to add this line chrome_options.add_argument("--disable-blink-features=AutomationControlled"). I also added headers and a user-data-dir to always use the same profile everytime but that's not enough as the loop still appears, as already mentioned.

As a first quick fix, it worked for me to change from seleniumwire webdriver to the "normal" selenium webdriver. Doing this you can at least scrape the texts but to get the audio files you need to have access to the request tab, so audio scraping won't work any longer with this.
Does someone have an idea why the website could know it's a bot with seleniumwire webdriver with the exact same settings of the selenium webdriver?

Edit: I think the problem has something to do with the certificate as selenium-wire issues its own certificate (selenium-wire manual). I already added the Selenium Wire CA to Chrome's Authorities section, but the problem remains.

usb4 · 2020-12-29T22:18:44Z

I also run into the hCaptcha loop but can get around it with the following arguments:

    # prevent Cloudflare from detecting ChromeDriver as bot
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")

Occasionally, without these arguments, I find that my first scrape attempt in 12+ hours usually avoids triggering Captcha.

However, audio scraping still doesn't work.

[13:09:39] WARNING Could not find audio url in request, aborting audio scrape...
[13:09:39] ERROR Error processing audio url, aborting audio scrape...

leoncvlt · 2020-12-30T17:41:50Z

In my tests, I had to override the user agent as well on top of implementing @usb4's flags. Although it still asked for the captcha when making a request for the blink's audio files.

Reading around, I found this discussion - https://stackoverflow.com/questions/32795460/loading-json-object-in-python-using-urllib-request-and-json-modules - and magically, yes, using urllib.request instead of requests doesn't seem to trigger the captcha. I tried implementing the other approach they suggested, where you connect to the IP address instead of the host, but was getting some SSL problems.

I pushed my changes in f4cab05, tested (albeit only on the free daily book) and seems to work fine on my end.

rocketinventor · 2020-12-31T12:44:17Z

Leonardo, which user agent did you use with requests? The default one is a scraper user-agent. That could be why 'urllib.request' "magically" works.

GermanEngineering · 2020-12-31T21:40:38Z

Thank you very much leoncvlt!

leoncvlt · 2021-01-01T15:35:03Z

Leonardo, which user agent did you use with requests? The default one is a scraper user-agent. That could be why 'urllib.request' "magically" works. In my tests (Windows 10), it was enough to switch from 'seleniumwire.webdriver' to 'selenium.webdriver' (Flowni's "quick fix") and maybe also add in the "--disable-blink-features=AutomationControlled" argument (as per Peter's comment). However, it doesn't seem like any of the other arguments/lines, user-agents, data-dirs, etc, are needed at all. Perhaps those arguments could even prevent selenium-wire from accessing the audio URL's/requests properly. As far as the audio goes, it looks like there is a hard-coded URL now that points to the chapter audio... If so, it might be possible to completely ditch the chrome/selenium web-driver (except maybe to get the cookies). That should really get its own issue / pull-request, so I won't discuss the details much here.

In my case, the user agent was needed to access the actual library / books pages, not specifically for the audio files.

I'm using selenium wire to capture the original audio files request and re-use the cookies / auth information to request the rest of the audio blinks - if anyone can come up with an alternative way of accomplishing this, we could scrap the selenium wire requirements 😃

This comment has been minimized.

Sign in to view

GermanEngineering closed this as completed Dec 31, 2020

rocketinventor mentioned this issue May 6, 2021

Infinite captcha #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck in Cloudflare hCaptcha loop. #31

Stuck in Cloudflare hCaptcha loop. #31

GermanEngineering commented Dec 14, 2020 •

edited

Loading

bckncook commented Dec 15, 2020

GermanEngineering commented Dec 16, 2020

leoncvlt commented Dec 17, 2020

GermanEngineering commented Dec 17, 2020

wywywywy commented Dec 17, 2020

wywywywy commented Dec 17, 2020

rocketinventor commented Dec 21, 2020 via email

mikaelaatan commented Dec 23, 2020 •

edited

Loading

flowni commented Dec 27, 2020 •

edited

Loading

This comment has been minimized.

usb4 commented Dec 29, 2020

leoncvlt commented Dec 30, 2020

rocketinventor commented Dec 31, 2020 via email •

edited

Loading

GermanEngineering commented Dec 31, 2020

leoncvlt commented Jan 1, 2021

Stuck in Cloudflare hCaptcha loop. #31

Stuck in Cloudflare hCaptcha loop. #31

Comments

GermanEngineering commented Dec 14, 2020 • edited Loading

bckncook commented Dec 15, 2020

GermanEngineering commented Dec 16, 2020

leoncvlt commented Dec 17, 2020

GermanEngineering commented Dec 17, 2020

wywywywy commented Dec 17, 2020

wywywywy commented Dec 17, 2020

rocketinventor commented Dec 21, 2020 via email

mikaelaatan commented Dec 23, 2020 • edited Loading

flowni commented Dec 27, 2020 • edited Loading

This comment has been minimized.

usb4 commented Dec 29, 2020

leoncvlt commented Dec 30, 2020

rocketinventor commented Dec 31, 2020 via email • edited Loading

GermanEngineering commented Dec 31, 2020

leoncvlt commented Jan 1, 2021

GermanEngineering commented Dec 14, 2020 •

edited

Loading

mikaelaatan commented Dec 23, 2020 •

edited

Loading

flowni commented Dec 27, 2020 •

edited

Loading

rocketinventor commented Dec 31, 2020 via email •

edited

Loading