Run

./run.sh

Architecture

TODO Things that a distrubuted web crawler needs to think about

1. Politeness

One should think about the traffic their crawler will produce. "Politeness" is being polite to the server owners. There are some conventions we can follow.

robots.txt is a file should be in the root of the domain like baskent.edu.tr/robots.txt in this file there should be file like this.

Below file says any User-Agent should disallow below path. We respect this convention in gocraw.

User-Agent: *
Disallow: /foo

2. DNS Query

Since we go BFS all the way down, we'll encounter lots of different domains. Resoluting these domains can be expensive in terms of time.

There are two aspects of this problem, first is IPS's DNS servers. These servers are generally terribly slow, even we never use them in our PCc, smartphones etc moreover I setted default dns to Google's 8.8.8.8 and 8.8.4.4 in my modem. Also ISP's dns resolvers tend to create buggy behaviour. For example in this week, 13 December, Superonline did some bad job on their new modem. Our company suffered from this failure. Another big problem especially in crawling is ISP's may randomly assign their static HTMLs to some domains. This may false feed our crawler, this problem may break some REST API calls too.

Second approach is just settings 8.8.8.8 like proper DNS server. Well this has distance problem, even though we cache in OS our program still needs to go to the far computer of the Google.

We solved this problem by creating local DNS server, only for cache, and fill it with random popular domains.

3. Duplicates

Duplicate values are very hard to solve in worker nodes. We solved this problem by centralizing url scheduling with a key value store.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
internal/rabmq		internal/rabmq
master		master
persister		persister
worker		worker
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conf.yaml		conf.yaml
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
master.Dockerfile		master.Dockerfile
persister.Dockerfile		persister.Dockerfile
run.sh		run.sh
worker.Dockerfile		worker.Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Run

Architecture

TODO Things that a distrubuted web crawler needs to think about

1. Politeness

2. DNS Query

3. Duplicates

About

Releases

Packages

Languages

License

eugercek/gocraw

Folders and files

Latest commit

History

Repository files navigation

Run

Architecture

TODO Things that a distrubuted web crawler needs to think about

1. Politeness

2. DNS Query

3. Duplicates

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages