Skip to content

praveshtora/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Recursively​ crawl popular blogging website ​ https://medium.com​ using Node.js and harvest all possible hyperlinks that belong to ​ medium.com​ and store them in a database of your choice

What do you need to store ?

  • Every unique URL you encountered.
  • The total reference count of every URL.
  • A complete unique list of parameters associated with this URL.

Things you should keep in mind:

  • Your solution needs to be ​ asynchronous​ in nature.
  • Maintain a ​ concurrency of 5 requests​ at all times, do not end up getting blocked.
  • If you are using ​ request.js​ , you are not allowed to use its connection pool.
  • You are not allowed to use any external scraping or ​ async​ library.
  • Refrain from using ​ throttled-request​ package to limit concurrency. Things that we love:​ ( Highly encouraged, but not required )
  • A well baked README file
  • Project setup with a simple command
  • A concise project structure with configurations
  • Good commit history with meaningful and atomic commits
  • A dockerized solution

Expected behavior of your solution

Assume your recursive scraper parses 4 URLs:

  1. https://medium.com/some/thing
  2. https://medium.com/some/thing?param1=abc
  3. https://medium/com/some/thing?param2=xyz
  4. https://medium/com/some/thing?param1=def&param3=xxx

Your chosen database must contain the URL ​ https://medium.com/some/thing​ with the a reference count​ of 4. A unique list of parameters containing ​ param1,param2,param3.​ Don’t worry about parameter values.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published