-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to limit number of pages crawled? #71
Comments
Hmm, this is an interesting suggestion! One concern is that the pages that are crawled would be nondeterministic, i.e. if you ran lighthouse-parade twice with the same flags it could crawl a different set of pages because of pages loading at different speeds, throttling, etc. The "first n pages" is not necessarily a representative sample of all the pages on the site. Do you have a suggestion of how to make the crawled pages more deterministic & representative of the whole site? |
Interesting challenge, that didn't occur to me. For my use, I'd only ever want the page I point it at, plus some (and ideally yes, always the same) set of linked pages from that page. So like, index page plus 20 pages linked off it. In that case, it seems like it would always be deterministic to the extent the page itself isn't changing. Given
I could see the complication increasing if this were used with a max crawl depth of more than 1, so maybe they're just mutually exclusive options? One would either specify to crawl some fixed depth entirely, or use this "index page plus N children" mode. OTOH, yeah if it is going to add too much complexity perhaps it's not necessary just for my use case. (Maybe if I could just run lighthouse parade multiple times, specifying a single page each time, and have those results all lumped into the same csv/report that could do the trick?) |
Just wanted to provide support for this. --max-crawl-depth 2 on one site gives me 50 pages. --max-crawl-depth 3 I stopped somewhere after 2k pages. There should be somewhere in between. |
@mgifford the new version (on the |
Excellent.. Happy to hear this. |
@calebeby what is the best way to test with the next branch? I'm currently running with: |
Hi @mgifford! I published a beta of it on the Let me know if you run into anything else! |
That's great. Might want to just add that to https://github.com/cloudfour/lighthouse-parade Thanks! |
Hey this is a cool tool. Here's a feature idea assuming you're open to it.
Currently one can use:
to get, say, index page + 1st linked pages.
But maybe there are a lot of linked pages, and you just need a representative sample which is more than 1 page but not tons of pages.
So maybe another option like:
and the crawler stops after it exhausts all pages allowed by other options, or N, whichever is smaller.
One might then use it like:
The text was updated successfully, but these errors were encountered: