Web Crawler

Simple web crawling tool

Crawl a entire website with just a start url
Output the result as a CSV file
Can start and stop from the CSV file (need doc)
Fast

Usage in your project as a lib

npm install git+https://github.com/hugopoi/webcrawler.git

const Webcrawler = require('webcrawler');
const Url = require('url');
let seedUrl = 'http://blog.hugopoi.net/';
let parsedUrl = Url.parse(seedUrl);
let webCrawl = new Crawler({
  hostname: parsedUrl.hostname, // Limit crawl to this domain name
  includeSubdomain: false, // Crawl subdomain
  limit: 10000, // Max urls to crawl
  concurrency: 20, // Concurent http call
  priorityRegExp: false, // String to prioritize during crawl, if in
title, url or link text
  nofollow: false, // Ignore urls on nofollow page
  noindex: false // Ignore urls on noindex page
}, [ { url: seedUrl } ]);

webCrawl.promise.then(urls => {
  console.log('Crawl %d urls.', urls.length);
});

webCrawl.emitter.on('url.done', urlData => {
  console.log(urlData.url);
});

You can display debug messages DEBUG="webcrawler" node .

Command Line Tool Features

Export all founded URLs to csv file.
Visit only content text/html content URL

Option Include all subdomains --include-subdomain Option Limit --limit 10

Todo list

Add options & documentation
Update package.json for command line tool usage
Make package on npmjs.org
Add unit test
Limit option on already checked urls

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
lib		lib
.gitignore		.gitignore
README.md		README.md
console.js		console.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Usage in your project as a lib

Command Line Tool Features

Todo list

About

Releases

Packages

Languages

HugoPoi/webcrawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Usage in your project as a lib

Command Line Tool Features

Todo list

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages