Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to limit number of pages crawled? #71

Open
techieshark opened this issue Mar 31, 2021 · 8 comments
Open

add option to limit number of pages crawled? #71

techieshark opened this issue Mar 31, 2021 · 8 comments

Comments

@techieshark
Copy link

techieshark commented Mar 31, 2021

Hey this is a cool tool. Here's a feature idea assuming you're open to it.

Currently one can use:

--max-crawl-depth  2

to get, say, index page + 1st linked pages.

But maybe there are a lot of linked pages, and you just need a representative sample which is more than 1 page but not tons of pages.

So maybe another option like:

--max-crawled-pages N # or just --max-pages ?

and the crawler stops after it exhausts all pages allowed by other options, or N, whichever is smaller.

One might then use it like:

lighthouse-parade --max-crawl-depth 2 --max-crawled-pages 20 example.com
@calebeby
Copy link
Member

calebeby commented Apr 1, 2021

Hmm, this is an interesting suggestion! One concern is that the pages that are crawled would be nondeterministic, i.e. if you ran lighthouse-parade twice with the same flags it could crawl a different set of pages because of pages loading at different speeds, throttling, etc. The "first n pages" is not necessarily a representative sample of all the pages on the site. Do you have a suggestion of how to make the crawled pages more deterministic & representative of the whole site?

@techieshark
Copy link
Author

Interesting challenge, that didn't occur to me.

For my use, I'd only ever want the page I point it at, plus some (and ideally yes, always the same) set of linked pages from that page. So like, index page plus 20 pages linked off it. In that case, it seems like it would always be deterministic to the extent the page itself isn't changing.

Given linked_page_limit = --max-linked-pages N (or --max-leaf/outer-pages N):

  1. fetch the index page
  2. let all_index_links be an array of all links on the index page (removing duplicates).
  3. let index_links be the first linked_page_limit items in all_index_links
  4. fetch all pages in index_links
  5. run lighthouse on the array [index page, ... index_links]

I could see the complication increasing if this were used with a max crawl depth of more than 1, so maybe they're just mutually exclusive options? One would either specify to crawl some fixed depth entirely, or use this "index page plus N children" mode.

OTOH, yeah if it is going to add too much complexity perhaps it's not necessary just for my use case. (Maybe if I could just run lighthouse parade multiple times, specifying a single page each time, and have those results all lumped into the same csv/report that could do the trick?)

@mgifford
Copy link

Just wanted to provide support for this. --max-crawl-depth 2 on one site gives me 50 pages. --max-crawl-depth 3 I stopped somewhere after 2k pages.

There should be somewhere in between.

@calebeby
Copy link
Member

@mgifford the new version (on the next branch currently) will support stopping the command with ctrl-c when you have enough output, the results (up to that point) will all be saved in the output, so you can stop it at any point you want.

@mgifford
Copy link

Excellent.. Happy to hear this.

@mgifford
Copy link

@calebeby what is the best way to test with the next branch?

I'm currently running with:
npx lighthouse-parade https://www.example.ccom ./lighthouse-parade-data --max-crawl-depth 3

@calebeby
Copy link
Member

Hi @mgifford! I published a beta of it on the next tag on npm: https://www.npmjs.com/package/lighthouse-parade?activeTab=versions. You can install it with npm i -g lighthouse-parade@next, or you can use it through npx like this: npx lighthouse-parade@next https://www.example.ccom/ ./lighthouse-parade-data --max-crawl-depth 3. I have been meaning to finalize the release for quite a while now, but have been super busy with school.

Let me know if you run into anything else!

@mgifford
Copy link

mgifford commented May 2, 2023

That's great. Might want to just add that to https://github.com/cloudfour/lighthouse-parade

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants