This is a collection of useful resources from our community.
Many free resources exist for helping you to crawl and scrape the vast amount of information on the internet. No matter what language you are most comfortable with, there are options for you.
Each programmatic request you make to the target website is called a network or HTTP request. Several libraries exist in every programming language under the sun to help you send these types of requests.
- Axios (Javascript and Node.js)
- Guzzle (PHP)
- Requests (Python)
- Requests-HTML (Python) - This is a fork of the Requests module linked directly above that includes full Javascript support, so that you can render Javascript-heavy web pages and extract the data.
- BeautifulSoup (Python)
- MechanicalSoup (Python)
- Scrapy (Python)
- Cheerio (Javascript and Node.js)
Sometimes in order to scrape successfully without being IP-banned or rate-limited, you need to use a number of different IP addresses that you switch out automatically with each request to the target server. These IP addresses are called proxies.
- Pastebin
- StormProxies [PAID]
- ProxyMesh [PAID]
Unverified Sites
These are sites that provide seemingly free proxies, but their quality and effectiveness is unknown. Be aware that these sites might be tracking your use of the proxies, so keep that in mind. Use at your own risk!
In your script, you are going to want to verify that each of your proxies is working before using it in a request to the target server. The easiest way to do that is to ping Google.com by making a single request with each proxy in your list to Google's primary domain.
Sample Code:
import requests
def check_proxy(proxy):
proxy_dictionary = {
'http:' : proxy,
'https:' : proxy,
'socks' : proxy
}
try:
response = requests.get('https://google.com', timeout=4, proxies=proxy_dictionary)
if response.status_code === 200:
return true
else:
raise Exception("Bad Proxy!")
except Exception as error:
print(error)