Skip to content

A collection of useful resources from our community

Notifications You must be signed in to change notification settings

checker/resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Community Resources

This is a collection of useful resources from our community.

Scraping 101

Many free resources exist for helping you to crawl and scrape the vast amount of information on the internet. No matter what language you are most comfortable with, there are options for you.

Network/HTTP Requests

Each programmatic request you make to the target website is called a network or HTTP request. Several libraries exist in every programming language under the sun to help you send these types of requests.

  • Axios (Javascript and Node.js)
  • Guzzle (PHP)
  • Requests (Python)
  • Requests-HTML (Python) - This is a fork of the Requests module linked directly above that includes full Javascript support, so that you can render Javascript-heavy web pages and extract the data.

Data Extraction

Proxies

Sometimes in order to scrape successfully without being IP-banned or rate-limited, you need to use a number of different IP addresses that you switch out automatically with each request to the target server. These IP addresses are called proxies.

Finding Proxies

Unverified Sites

These are sites that provide seemingly free proxies, but their quality and effectiveness is unknown. Be aware that these sites might be tracking your use of the proxies, so keep that in mind. Use at your own risk!

Programmatic Proxy Verification

In your script, you are going to want to verify that each of your proxies is working before using it in a request to the target server. The easiest way to do that is to ping Google.com by making a single request with each proxy in your list to Google's primary domain.

Sample Code:

import requests

def check_proxy(proxy):
    proxy_dictionary = {
        'http:' : proxy,
        'https:' : proxy,
        'socks' : proxy
    }

    try:
        response = requests.get('https://google.com', timeout=4, proxies=proxy_dictionary)
        if response.status_code === 200:
            return true
        else:
            raise Exception("Bad Proxy!")
    except Exception as error:
        print(error)

About

A collection of useful resources from our community

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published