Web Crawler

Author : Mihalis Plelis

The concurrent web crawler, crawls URLs and for each page, it determines the URLs of every static asset (images, javascript, stylesheets) on that page. By default, a thread pool of 10 threads is used. The crawler outputs to STDOUT in JSON format listing the URLs of every static asset, grouped by page.

Example:

[
  {
    "url": "http://www.example.org",
    "assets": [
      "http://www.example.org/image.jpg",
      "http://www.example.org/script.js"
    ]
  },
  {
    "url": "http://www.example.org/about",
    "assets": [
      "http://www.example.org/company_photo.jpg",
      "http://www.example.org/script.js"
    ]
  }
]

Running the web crawler

Under the target directory of the project, the file webCrawler.jar is generated after maven builds the project with the command clean package.
In order to execute that jar file, it has to be called from command line, using the java -jar command.
It can receive two arguments. The first one is -cl and the second one is a number.
-cl stands for crawl limit and when it is used, it sets a limit for the pages to be crawled and this number is parsed from the second argument which is given to the program.
If no arguments are given, then the web crawler, will crawl all the URLs till there are no more left.

Examples:
- java -jar webCrawler.jar -cl 10
- java -jar webCrawler.jar

When the application is executed, it is asking for a valid URL to be given, in order to start crawling.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Author : Mihalis Plelis

Running the web crawler

About

Releases

Packages

Languages

mplelis/web-crawler-java

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Author : Mihalis Plelis

Running the web crawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages