Skip to content
This repository has been archived by the owner on Jul 21, 2022. It is now read-only.

目前知乎服务器会限制Request的次数来反爬虫吗 #42

Open
tzhao0311 opened this issue Nov 21, 2015 · 11 comments
Open

目前知乎服务器会限制Request的次数来反爬虫吗 #42

tzhao0311 opened this issue Nov 21, 2015 · 11 comments

Comments

@tzhao0311
Copy link

您好!我是初学者。最近用您开发的API写了一个爬虫,但每次爬到一定数目时就停止了,我想问是因为知乎服务器端有访问限制吗?有具体的解决方法吗?

@7sDream
Copy link
Owner

7sDream commented Nov 22, 2015

恩,建议提供一下“爬虫停止”的具体表现。
因为虽然知乎确实会反爬虫,但是一般来说反爬虫措施会直接导致代码出错,而不是“停止”。

@7sDream 7sDream added the HelpMe label Nov 22, 2015
@tzhao0311
Copy link
Author

我目前在爬某个用户的follower的url,每当我爬到一定数量时,就会出现类似下面的错误提示,每次可能还不太一样,这次是爬到12万个时出现的错误提示,请问是什么原因。
Traceback (most recent call last):
File "/Users/zhaotao/PycharmProjects/zhihu_test_1/user_profile_crawler.py", line 53, in
for follower in author.followers:
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 359, in followers
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 405, in _follow_ee_ers
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/common.py", line 103, in wrapper
ValueError: Invalid URL

@7sDream
Copy link
Owner

7sDream commented Nov 22, 2015

看上去像是bug而不是遇到了访问限制。(不过12万略微是有点多了,还是要注意下)
提供一下你在爬的用户的主页地址吧,有时间的话我测试下。

@tzhao0311
Copy link
Author

知乎现在是利用哪种反爬虫策略啊,限制IP、cookies、访问速度,还是其他策略。这是我在爬的用户的主页地址:http://www.zhihu.com/people/zhang-jia-wei

@7sDream
Copy link
Owner

7sDream commented Nov 22, 2015

访问速度太快会封IP,也有可能被封账号,所以建议申请小号加代理来爬。ZhihuClient有个设置HTTP代理的接口。
明天早上我测测看。

@tzhao0311
Copy link
Author

好的,多谢!

@tzhao0311
Copy link
Author

这次跑到3万多的时候出现了如下的错误,不知道是不是bug。
Traceback (most recent call last):
File "/Users/zhaotao/PycharmProjects/zhihu_test_1/user_profile_crawler.py", line 54, in
for follower in author.followers:
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 359, in followers
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 392, in _follow_ee_ers
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests-2.8.1-py3.5.egg/requests/models.py", line 805, in json
return complexjson.loads(self.text, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/init.py", line 319, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

@7sDream
Copy link
Owner

7sDream commented Nov 24, 2015

看样子不是bug而是知乎看你请求太快,发回了一些错误响应,导致json没法解析。

建议你这样,编程控制一下,爬1000个人,暂停个10秒之类的……反正就暂时手动降低一下请求速度……嗯嗯~

以后我们会处理这个问题的,比如给网络访问加上自动重试机制。(不过还比较遥远……)

@7sDream 7sDream added Bug Report and removed HelpMe labels Nov 24, 2015
@tzhao0311
Copy link
Author

多谢,我试一下,有问题再请教你。

@tzhao0311
Copy link
Author

现在每次跑到300多或者400多就出现如下的错误提示,不会是因为我的账号已经被知乎限制了吧
Traceback (most recent call last):
File "/Users/zhaotao/PycharmProjects/zhihu_test_1/user_profile_crawler.py", line 55, in
for follower in author.followers:
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 359, in followers
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 391, in _follow_ee_ers
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests-2.8.1-py3.5.egg/requests/sessions.py", line 511, in post
return self.request('POST', url, data=data, json=json, *_kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests-2.8.1-py3.5.egg/requests/sessions.py", line 468, in request
resp = self.send(prep, *_send_kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests-2.8.1-py3.5.egg/requests/sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests-2.8.1-py3.5.egg/requests/adapters.py", line 412, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

Process finished with exit code 1

@7sDream
Copy link
Owner

7sDream commented Nov 25, 2015

最后一行

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

表示是知乎重置了链接……有没有被限制帐号不知道……但是确实不是代码的问题而是网站的行为导致的错误……

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants