This is designed to scrape the data from the Github dependency graph page into a JSON file
- Install
yarn
- Clone the repo
- Run
yarn
- Installs dependencies - Run
npx tsc
- Compilesindex.ts
- Run the scrapper
node index.js repoOwner/repo dependents.json
- The command line arguments for the scrapper are as follows:
- (
githubOwnerAndRepo
)repoOwner/repo
- This is what's displayed in the Github URL when on the repo page e.g. for this repo it would bespacesailor24/github-dependents-scraper
- (
dependentsFile
)anything.json
- This file can be named anything, but it needs to be a valid JSON file ending with the.json
file extension - (
resumeCrawl
)true
orfalse
- Eventually this crawler will get rate limited by Github, this flag allows you to run the crawler from where it left off before receiving the rate limit page- So if the crawler dies because of rate limiting, you'd start it up again with:
NOTE Starting it withnode index.js repoOwner/repo dependents.json true
false
will override thedependentsFile
and start scrapping from the first dependents page
- (
- The command line arguments for the scrapper are as follows:
Maybe I'll extend crawler to be able to sort the data, but for now, there's a nifty online sorter that'll do the trick!