Skip to content

AkashV420/distributed-web-crawler-pagerank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Distributed Web Crawler with Page Rank

Introduction

This distributed web crawler uses multithreading to construct the web graph of linking relationships between websites starting at certain root site provided by the user. The crawler will store this graph in a distributed manner and will be able to respond to several kinds of queries about this graph.

Page rank is also calculated which is a value assigned to a web page as a measure of its popularity or importance, used to determine the order in which search engine results are presented.

Running

Code link: https://github.com/dhruvarya/distributed-web-crawler-pagerank Download the crawler from the link

Go to webcrawlermain.go and update siteToCrawl and depth variable

	siteToCrawl := "https://www.linkedin.com"
	depth := 5
	

Use commands

=> go build
=> go run .

Working

  • It creates multiple threads as workers that send requests to root website
  • Find each website is created as a node, with page rank, incoming nodes and outgoing nodes as attributes
// Node is a node in the graph
type Node struct {
	Id       NodeID
	Rank     float64
	Outgoing map[EdgeID]*Edge
	Incoming map[EdgeID]*Edge
}
  • extractLinksFromToken function is called from linksextractor.go file that take out the links from a website and return them.
  • Clients can initiate the crawling, specifying an initial web page, and can also query the crawled dataset.
  • Page rank is calculated using the graph of websites created.

Architecture

alt text

Page Rank Calculation

In the algorithm, we used the formula

alt text

Calculate the new rank based on the current rank and the out-degree, out-degree is the number of neighbors of pr node

pr := PageRank{
		Alpha:     0.85,
		MaxIter:   1000,
		Tolerance: 1e-12,
	}

Here, Alpha is the damping parameter for PageRank, default=0.85. MaxIter is the max amount of iterations Tolerance, (i.e., if you get a new page-rank value that differs from the prior iteration by less than this amount, then stop).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages