Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multiple URL's #212

Open
cipriancraciun opened this issue Mar 9, 2022 · 2 comments
Open

Add support for multiple URL's #212

cipriancraciun opened this issue Mar 9, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@cipriancraciun
Copy link

Assuming #38 is solved (i.e. muffet doesn't fetch the same URL twice), it would be useful to allow muffet to take multiple URLs.

For example, say one has both a www and a blog site that happen to share some resources. If one would be able to list both sites in the same muffet invocation, the shared URLs would be checked only once.

A different use-case would be in conjunction with --one-page-only (i.e. turning recursion off) and listing all known URLs on the command line.


Complementary with the multiple URLs, a separate option to read these URLs from a file would allow even more flexibility.

For example, one could take the sitemap.xml, process that to extract the URLs that search engines would actually crawl, put these URLs in a file, one per line, and instruct muffet to execute only on those URLs.

For example muffet --one-page-only --urls ./sitemap.txt would try all links listed in sitemap.txt without recursing.

Meanwhile muffet --urls ./sitemap.txt would try all links listed, but recurse for each link bun not crossing the domain listed in that URLs domain.

@raviqqe
Copy link
Owner

raviqqe commented Mar 26, 2022

Is it possible to simply use a shell script or other scripting language for this use case? It would be just a one-line code, I guess.

For example,

for url in $(cat urls.txt); do muffet $url; done

How big is your URL list in your use case?

@cipriancraciun
Copy link
Author

As said previously, the main use case is when a set of different domains share a lot of common URL's, thus checking each of them independently might just generate needless traffic, and could get one throttled, especially by GitHub or CloudFlare.

The proposal of having the list of URLs read from a file would provide an alternative of supporting multiple domains, but would also provide support for checking only a smaller part of a larger site, without having to resort to complex exclusion regular expressions.

Or, for example, one could use muffet to crawl an entire site, extract pages that have broken links, fix only those, and re-run muffet only on those links with --one-page-only, but again, by allowing shared resources to be crawled only once.

@raviqqe raviqqe added the enhancement New feature or request label May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants