#Recursively crawl popular blogging website https://medium.com using Node.js and harvest all possible hyperlinks that belong to medium.com and store them in a database of your choice
What do you need to store ?
- Every unique URL you encountered.
- The total reference count of every URL.
- A complete unique list of parameters associated with this URL.
Things you should keep in mind:
- Your solution needs to be asynchronous in nature.
- Maintain a concurrency of 5 requests at all times, do not end up getting blocked.
- If you are using request.js , you are not allowed to use its connection pool.
- You are not allowed to use any external scraping or async library.
- Refrain from using throttled-request package to limit concurrency. Things that we love: ( Highly encouraged, but not required )
- A well baked README file
- Project setup with a simple command
- A concise project structure with configurations
- Good commit history with meaningful and atomic commits
- A dockerized solution
Expected behavior of your solution
Assume your recursive scraper parses 4 URLs:
- https://medium.com/some/thing
- https://medium.com/some/thing?param1=abc
- https://medium/com/some/thing?param2=xyz
- https://medium/com/some/thing?param1=def¶m3=xxx
Your chosen database must contain the URL https://medium.com/some/thing with the a reference count of 4. A unique list of parameters containing param1,param2,param3. Don’t worry about parameter values.