Skip to content

FullerHua/scrapers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 

Repository files navigation

Scrapers

A list of scrapers from around the web.

Find your way through with the Table of Contents. It will showcase the entire list with easy navigate to their pros and cons while also providing links to their respective websites.

Please contribute by adding links, adding pros/cons, titles, or anything else you think would be helpful! Please help maintain alphabetical order.

Table Of Contents

Description: Cloud-based scraper for JavaScript.

Pros

Cons

Applicable Language(s)

  • JavaScript

Description: A Python library for navigating and parsing results from the Web. It allow for searching the HTML tree to find various tags.

Pros

Cons

Applicable Language(s)

  • Python

Description: Service for looking up company and people information.

Pros

Cons

Applicable Language(s)


Description: Open dataset of crawled websites.

Pros

Cons

Applicable Language(s)


Description: Automatic service that turns a website into structured data in the form of JSON or CSV.

Pros

Cons

Applicable Language(s)


Description: Website data extraction using a visual programming language.

Pros

Cons

Applicable Language(s)


Description: Automated tool for extracting structured information from pages, crawling websites, and turning a website into an API.

Pros

Cons

Applicable Language(s)


Description: Tool to mine LinkedIn profiles based on keywords.

Pros

Cons

Applicable Language(s)


Description: Local software that can download a proxy list and let users choose which one to use.

Pros

Cons

Applicable Language(s)


Description: API to find e-mail addresses for a given domain name.

Pros

Cons

Applicable Language(s)


Description: Provide various website extraction and transformation tools such as Full-Text RSS and Term Extraction as services.

Pros

Cons

Applicable Language(s)


Description: Local software for web scraping using a recording and a visual programming language.

Pros

Cons

Applicable Language(s)


Description: API to retrieve more information on a person.

Pros

Cons

Applicable Language(s)


Description: Having annotated a web page, an extractor is generated automatically. All are done in WYSIWYG mode. Thereafter the extractor can be fed into the GooSeeker crawler cluster for harvesting the web in bulk. Or it can be published through the API to third-party crawlers, e.g. Scrappy and other Python crawlers. The mobile apps can also import the extractor to mashup web content in real-time. The web harvest programers are released a lot from debugging the complicated extractors.

Pros: The middle ware hides the operating system, so GooSeeker can run over Windows, Mac OSX and Linux. Coded in C++ guarantees the performance.

Cons: GooSeeker is a heavy-weight scraper since a full-feature browser is embedded, which consumes more computer power.

Applicable Language(s): C++, JavaScript, Python


Description: Service that searches a website for e-mails.

Pros

Cons

Applicable Language(s)


Description: Automated tool to extract structured information from websites.

Pros

Cons

Applicable Language(s)


Description: Kimono was acquired by Palantir. This was a cloud-based service for turning websites into structured APIs. Now they offer a desktop-based alternative for continuing to use their tools.

Pros

Cons

Applicable Language(s)


Description: lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

Pros

Cons

Applicable Language(s)

  • Python

Description: Extract structured information from HTML, PDF, Excel, and Word by clicking on document elements.

Pros

Cons

Applicable Language(s)


Description: Web crawler that can be combined with the Hadoop ecosystem to run in a cluster.

Pros

Cons

Applicable Language(s)


Description: Application that can extract information from a website and turn it into structured data (CSV, Excel, etc.).

Pros

Cons

Applicable Language(s)


Description: The free web scraping tool for extracting all the web page data into several structured file formats easily and effectively.

Pros

Cons

Applicable Language(s)


Description: R package to scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.

Pros

Cons

Applicable Language(s)

  • R

Description: A Node.js scraper for humans.

Pros

Cons

Applicable Language(s)

  • JavaScript (Node.js)

Description: Write a scraper in the browser and run on their cloud-based service. This is used by many news organisations.

Pros

Cons

Applicable Language(s)


Description: Scraper cloud hosting as a service. Allows developers to deploy their own scrapers on their platform and benefit from their existing infrastructure.

Pros

Cons

Applicable Language(s)


Description: Local tool for scraping websites.

Pros

Cons

Applicable Language(s)


Description: Service for looking up business e-mails.

Pros

Cons

Applicable Language(s)


Description: Web automation software using a visual programming language and recorder.

Pros

Cons

Applicable Language(s)


Description: Visual tool for GUI automation by recording.

Pros

Cons

Applicable Language(s)


Description: Data as a Service platform for web scraping.

Pros

  • Scraping dynamic javascript heavy websites
  • Login and form fill on websites
  • Data normalization and validation
  • Data uploads

Cons

  • Currently in beta
  • Possible payment model in the future

Applicable Language(s)


Description: Extension that downloads websites and turns them into structured data. Data is selected by element or by specialised selectors (e.g., for tables).

Pros

Cons

Applicable Language(s)


Description: Turn a website into an API. The structure of the data is defined by clicking elements or regular expressions.

Pros

Cons

Applicable Language(s)


Description: NPM module for scraping structured data via jQuery-like selectors.

Pros

Cons

Applicable Language(s)

  • JavaScript (Node.js)

About

A list of scrapers from around the web.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published