Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request Yandex and Baidu #17

Open
LeoJavaAI opened this issue Apr 17, 2021 · 2 comments
Open

Feature request Yandex and Baidu #17

LeoJavaAI opened this issue Apr 17, 2021 · 2 comments

Comments

@LeoJavaAI
Copy link

Thanks for your work, Please consider adding Yandex and Baidu if possible

@tasos-py
Copy link
Owner

Sounds interesting, I'll see what I can do. I think Yandex is simple enough, but I don't know if we can scrape Baidu without Selenium and I'd like to avoid that.

@tasos-py
Copy link
Owner

After some research, I don't think I can add Yandex or Baidu. Yandex keeps giving me a captcha after a couple of requests. Maybe Selenium could help with that, but I want to keep this repo as simple as possible, so I'd rather not add browser automation or OCR dependencies.

Baidu doesn't require Selenium, the problem here is that it doesn't have direct links, the links are like this www.baidu.com/link?url=kh39xCQVnS7frJSxGrpfLAXdudtflGhAhAK8YjhSgpwyf0Sl8L41EGODywKx6Vvqy8UbcOnNGkuEntr1m9KLmq. The url= parameter looks like a base64 string, but it doesn't decode to text and I don't think decoding/decryption is done in client side, the server redirects to the final link. We could use the server to get the actual URLs, but that would be very inefficient and it would probably result in bans.

So, I don't know how to proceed further, if you have any ideas I'd love to hear them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants