Skip to content

Latest commit

 

History

History
70 lines (57 loc) · 1.83 KB

README.md

File metadata and controls

70 lines (57 loc) · 1.83 KB

JDSpider

目标:分布式爬取京东商品详情,评论和评论总结

Feature

  1. 总体框架划分四部分(一总控,三爬虫)灵活分配
  2. 爬虫皆为分布式部署,解决带宽和性能瓶颈
  3. proxy_pool解决ip封禁
  4. 禁用cookie防止浏览器记忆爬虫
  5. mysql底层数据存储

Power by:

  1. Python 3.6
  2. Scrapy 1.4
  3. pymysql
  4. json
  5. redis

How to use ?

git clone https://github.com/Dengqlbq/JDSpider.git

Override the following content

  1. ProjectStart/Test.py (redis configuration, keywords, page_count)
  2. JDUrlsSpider/settings.py (redis configuration)
  3. JDDetailSpider/settings.py (redis configuration, mysql configuration, DOWNLOAD_DELAY)
  4. JDCommentSpider/settings.py (redis configuratin, mysql configuration, DOWNLOAD_DELAY)
cd ProjectStart
python Test.py
cd JDUrlsSpider
scrapy crawl JDUrlsSpider
cd JDDetailSpider
scrapy crawl JDDetailSpider
(This is distributed crawler, you can run more than one JDDetailSpider)
cd JDCommentSpider
scrapy crawl JDCommentSpider
(This is distributed crawler, you can run more than one JDCommentSpider)

Note:

  1. Before you run the project, make sure that you have created tables match the requirement.
  2. If you did not build a proxy_pool, disable the "ProxyMiddleware" in JDCommetSpider/settings.py

Achievement

Product detail and comment summary 商品详情和评论总结
Some comments
部分评论数据
Full comment 评论都是完整评论