Community Resources

This is a collection of useful resources from our community.

Scraping 101

Many free resources exist for helping you to crawl and scrape the vast amount of information on the internet. No matter what language you are most comfortable with, there are options for you.

Network/HTTP Requests

Each programmatic request you make to the target website is called a network or HTTP request. Several libraries exist in every programming language under the sun to help you send these types of requests.

Axios (Javascript and Node.js)
Guzzle (PHP)
Requests (Python)
Requests-HTML (Python) - This is a fork of the Requests module linked directly above that includes full Javascript support, so that you can render Javascript-heavy web pages and extract the data.

Data Extraction

BeautifulSoup (Python)
MechanicalSoup (Python)
Scrapy (Python)
Cheerio (Javascript and Node.js)

Proxies

Sometimes in order to scrape successfully without being IP-banned or rate-limited, you need to use a number of different IP addresses that you switch out automatically with each request to the target server. These IP addresses are called proxies.

Finding Proxies

Unverified Sites

These are sites that provide seemingly free proxies, but their quality and effectiveness is unknown. Be aware that these sites might be tracking your use of the proxies, so keep that in mind. Use at your own risk!

Programmatic Proxy Verification

In your script, you are going to want to verify that each of your proxies is working before using it in a request to the target server. The easiest way to do that is to ping Google.com by making a single request with each proxy in your list to Google's primary domain.

Sample Code:

import requests

def check_proxy(proxy):
    proxy_dictionary = {
        'http:' : proxy,
        'https:' : proxy,
        'socks' : proxy
    }

    try:
        response = requests.get('https://google.com', timeout=4, proxies=proxy_dictionary)
        if response.status_code === 200:
            return true
        else:
            raise Exception("Bad Proxy!")
    except Exception as error:
        print(error)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
brand-assets		brand-assets
word-lists		word-lists
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Community Resources

Scraping 101

Network/HTTP Requests

Data Extraction

Proxies

Finding Proxies

Programmatic Proxy Verification

About

Releases

Packages

checker/resources

Folders and files

Latest commit

History

Repository files navigation

Community Resources

Scraping 101

Network/HTTP Requests

Data Extraction

Proxies

Finding Proxies

Programmatic Proxy Verification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages