GitHub - praveshtora/web-crawler

#Recursively crawl popular blogging website https://medium.com using Node.js and harvest all possible hyperlinks that belong to medium.com and store them in a database of your choice

What do you need to store ?

Every unique URL you encountered.
The total reference count of every URL.
A complete unique list of parameters associated with this URL.

Things you should keep in mind:

Your solution needs to be asynchronous in nature.
Maintain a concurrency of 5 requests at all times, do not end up getting blocked.
If you are using request.js , you are not allowed to use its connection pool.
You are not allowed to use any external scraping or async library.
Refrain from using throttled-request package to limit concurrency. Things that we love: ( Highly encouraged, but not required )
A well baked README file
Project setup with a simple command
A concise project structure with configurations
Good commit history with meaningful and atomic commits
A dockerized solution

Expected behavior of your solution

Assume your recursive scraper parses 4 URLs:

Your chosen database must contain the URL https://medium.com/some/thing with the a reference count of 4. A unique list of parameters containing param1,param2,param3. Don’t worry about parameter values.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
models		models
utils		utils
.gitignore		.gitignore
Readme.md		Readme.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

praveshtora/web-crawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages