Hproxy aims to make getting proxy as convenient as possible.
- Demo: https://hproxy.htmlhelper.org/api
- Introduction:中文 | English
The hproxy requires Python3.6+,and it use Sanic
to build asynchronous HTTP service and aiohttp
to crawl proxy data asynchronously.
git clone https://github.com/howie6879/hproxy.git
cd hproxy
pip install pipenv
# It should be noted that Python3.6 is required if you were in a virtual environment.
# Install Dependencies
pipenv install
# Start Crawler
cd hproxy
python server.py
# Start Crawling
python /hproxy/hproxy/spider/spider_console.py
The precondition to use hproxy is that Redis must has been installed because hproxy user Redis as default data storage mode,
and the specific configuration is in the config
directory.
# Database config
REDIS_DICT = dict(
REDIS_ENDPOINT=os.getenv('REDIS_ENDPOINT', "localhost"),
REDIS_PORT=os.getenv('REDIS_PORT', 6379),
REDIS_DB=os.getenv('REDIS_DB', 0),
REDIS_PASSWORD=os.getenv('REDIS_PASSWORD', None)
)
DB_TYPE = 'redis'
If you want to use Memory
in machine,you just need to change the value of DB_TYPE
from redis
to memory
in config
as following.
It should be noted that the data saved in Memory
will be lost when the service terminated if you use the memory
mode,
so the redis
mode is more recommended to keep the data.
If you want to use other mode,you just need to expand it referring to the coding standards in BaseDatabase
-
Multiple Data Storage Mode, Easy to expand:
-
Customize Crawling Components,Easy To Expand,Unity Code Style:
-
Provide API To Get Proxy,Visit
127.0.0.1:8001/api
- 'delete/:proxy': delete proxy
- 'get': get a proxy randomly
- 'list': list all proxies
- ...
-
Provide crawling HTML source code service by a random proxy from the proxy pool
-
Crawling,updating,and auto verifying at a regular time
-
Get accurate information of the proxy,such as type of the proxy,protocol,position and so on
The spider script are all in the spider directory.You can easily expand the spider/proxy_spider which includes many spider towards different agency websites referring to the coding standard in /spider/base_spider/proxy_spider.py
Execute spider_console.py to start all the spider scripts.If you want to expand,you just need to add function named in standard but not new script.
Run the following command to run the specific script like xicidaili
.
cd hproxy/hproxy/spider/proxy_spider/
python xicidaili_spider.py
# The process of verifying 100 proxies Asynchronously would finish in 5 seconds,because the proxy timeout is 5 seconds.
# But in Synchronous way,it's unpredictable.
# 2018/04/14 13:42:32 [Crawling finished ] OK. Crawling xicidaili finished,get 100 proxies - Invalid proxy num:8,cost 5.384464740753174 seconds
You can run the valid_proxy.py to verify whether the proxies are available automatically or manually. In an automatic way, hproxy will verify all the proxies per hour,and those which is verified failed over 5 times will be abandoned. In an manual way, you can run the following commands.
cd hproxy/hproxy/scheduler/
python valid_proxy.py
Route | Description |
---|---|
delete/:proxy | Delete a proxy |
get | Return a random proxy @param valid=1,continuously verify the return proxy until it's valid |
list | List all proxies without verification |
valid/:proxy | Verify a proxy |
html?url=''&ajax=0&foreign=0 | Select a random proxy and request |
// URL:http://127.0.0.1:8001/api/get?valid=1
// Description:Return successfully! If the value of the param 'valid' which is set to 1 as default is equal to 1,it will also return the value of param 'speed'.
// types 1:Elite 2:Anonymous 3:Transparent
{
"status": 1,
"info": {
"proxy": "101.37.79.125:3128",
"types": 3
},
"msg": "success",
"speed": 2.4909408092
}
// URL:http://127.0.0.1:8001/api/list
//List all proxies without verification
{
"status": 1,
"info": {
"180.168.184.179:53128": {
"proxy": "180.168.184.179:53128",
"types": 3
},
"101.37.79.125:3128": {
"proxy": "101.37.79.125:3128",
"types": 3
}
},
"msg": "success"
}
// URL:http://127.0.0.1:8001/api/delete/171.39.45.6:8123
{
"status": 1,
"msg": "success"
}
// URL:http://127.0.0.1:8001/api/valid/183.159.91.75:18118
{
"status": 1,
"msg": "success",
"speed": 0.3361008167
}
// URL:http://127.0.0.1:8001/api/html?url=https://www.v2ex.com
// Crawing v2ex and get a random proxy.
{
"status": 1,
"info": {
"html": "html source code",
"proxy": "120.77.254.116:3128"
},
"msg": "success"
}
Q1:Why it only crawls ip and port?
A1:Because the information of the proxy is not completely accurate,it need further verifying.The hproxy will verify whether the proxy is valid and its other information before returning the proxy.
Q2:How to expand data storage mode?
A2:Refer to BaseDatabase which defines some necessary functions required in subclass
Q3:How to expand proxy spider?
A3:Same as above,refer to the spider coding standard in the spider directory or code in a specific spider script directly.
hproxy is offered under the MIT license.
Thanks for the following items:
Thanks for the following agency website.If you have high-quality agency website,please click here #3 to submit ^_^.