Blacklist by Keyword #1243
Replies: 3 comments 1 reply
-
In this example, bbot is crawling +7k links in this format: https://www.bluehost.com/cdn-cgi/challenge-platform/h/b/jsd/r/879e95894fc60a61 If there was a feature that we could stop crawling during scan, same as |
Beta Was this translation helpful? Give feedback.
-
Agreed this would be a good feature to have. Converting to issue. |
Beta Was this translation helpful? Give feedback.
-
Raw idea: If there was a module in bbot, that was responsible for blacklisting and also could prevent HTTPX crawling similar links, I think that would help a lot in crawling duration. For example something like urless based on some configuration options check each URL before HTTPX wants to crawl it and decides if HTTPX has already crawled a similar link before or not and then allow for crawling or skip it. It can have some similar features: We define some keywords to blacklist them in crawling, for example /cdn-cgi/challenge-platform/ It only allows to crawl one language and skip others. It skips similar links of posts and articles and products. |
Beta Was this translation helpful? Give feedback.
-
In some programs, we need to blacklist specific path such as https://www.example.com/blog/
However, it seems this is not possible with bbot, I wanted to suggest if it's possible add blacklist based on keyword.
So, if I add blog , then it won't scan or crawl any links that have blog in it.
Thanks 🙏
Update: I was also thinking about a way to limit crawling of similar links. For example, a site can have 100k products. I want to crawl only one of them, because the others are similar to this. one. Or for example a site can have 50k posts, but I want to crawl one of them. That would be great if it's possible to implement this.
Beta Was this translation helpful? Give feedback.
All reactions