You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Assuming #38 is solved (i.e. muffet doesn't fetch the same URL twice), it would be useful to allow muffet to take multiple URLs.
For example, say one has both a www and a blog site that happen to share some resources. If one would be able to list both sites in the same muffet invocation, the shared URLs would be checked only once.
A different use-case would be in conjunction with --one-page-only (i.e. turning recursion off) and listing all known URLs on the command line.
Complementary with the multiple URLs, a separate option to read these URLs from a file would allow even more flexibility.
For example, one could take the sitemap.xml, process that to extract the URLs that search engines would actually crawl, put these URLs in a file, one per line, and instruct muffet to execute only on those URLs.
For example muffet --one-page-only --urls ./sitemap.txt would try all links listed in sitemap.txt without recursing.
Meanwhile muffet --urls ./sitemap.txt would try all links listed, but recurse for each link bun not crossing the domain listed in that URLs domain.
The text was updated successfully, but these errors were encountered:
As said previously, the main use case is when a set of different domains share a lot of common URL's, thus checking each of them independently might just generate needless traffic, and could get one throttled, especially by GitHub or CloudFlare.
The proposal of having the list of URLs read from a file would provide an alternative of supporting multiple domains, but would also provide support for checking only a smaller part of a larger site, without having to resort to complex exclusion regular expressions.
Or, for example, one could use muffet to crawl an entire site, extract pages that have broken links, fix only those, and re-run muffet only on those links with --one-page-only, but again, by allowing shared resources to be crawled only once.
Assuming #38 is solved (i.e.
muffet
doesn't fetch the same URL twice), it would be useful to allowmuffet
to take multiple URLs.For example, say one has both a
www
and ablog
site that happen to share some resources. If one would be able to list both sites in the samemuffet
invocation, the shared URLs would be checked only once.A different use-case would be in conjunction with
--one-page-only
(i.e. turning recursion off) and listing all known URLs on the command line.Complementary with the multiple URLs, a separate option to read these URLs from a file would allow even more flexibility.
For example, one could take the
sitemap.xml
, process that to extract the URLs that search engines would actually crawl, put these URLs in a file, one per line, and instructmuffet
to execute only on those URLs.For example
muffet --one-page-only --urls ./sitemap.txt
would try all links listed insitemap.txt
without recursing.Meanwhile
muffet --urls ./sitemap.txt
would try all links listed, but recurse for each link bun not crossing the domain listed in that URLs domain.The text was updated successfully, but these errors were encountered: