Web Crawler

Author : Mihalis Plelis

The concurrent web crawler, crawls URLs and for each page, it determines the URLs of every static asset (images, javascript, stylesheets) on that page. By default, a thread pool of 10 threads is used. The crawler outputs to STDOUT in JSON format listing the URLs of every static asset, grouped by page.

Example:

[
  {
    "url": "http://www.example.org",
    "assets": [
      "http://www.example.org/image.jpg",
      "http://www.example.org/script.js"
    ]
  },
  {
    "url": "http://www.example.org/about",
    "assets": [
      "http://www.example.org/company_photo.jpg",
      "http://www.example.org/script.js"
    ]
  }
]

Running the web crawler

Under the target directory of the project, the file webCrawler.jar is generated after maven builds the project with the command clean package.
In order to execute that jar file, it has to be called from command line, using the java -jar command.
It can receive two arguments. The first one is -cl and the second one is a number.
-cl stands for crawl limit and when it is used, it sets a limit for the pages to be crawled and this number is parsed from the second argument which is given to the program.
If no arguments are given, then the web crawler, will crawl all the URLs till there are no more left.

Examples:
- java -jar webCrawler.jar -cl 10
- java -jar webCrawler.jar

When the application is executed, it is asking for a valid URL to be given, in order to start crawling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Web Crawler

Author : Mihalis Plelis

Running the web crawler

Files

README.md

Latest commit

History

README.md

File metadata and controls

Web Crawler

Author : Mihalis Plelis

Running the web crawler