Whitelist start urls? #43

janpieper · 2014-06-26T08:27:03Z

If you use #follow_links_like and the given start urls does not match the configured regexps, the crawler stops working. Is there a reason why the start urls aren't whitelisted?

start_urls = [ "http://www.example.com/foo/bar" ]
Polipus.crawler("dummy", start_urls, options) do |crawler|
  crawler.follow_links_like(/\/bar\/foo/)
end

The links on the start page match the given regexp.

The text was updated successfully, but these errors were encountered:

tmaier · 2014-06-26T08:45:29Z

At https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L163 we check for #should_be_visited?. This is to allow skipping an url when the policy has changed during the crawl session but the page was already queued.

#should_be_visited? https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L351 returns false when the link does not match the pattern.

#page_exists? already checks for page.user_data.p_seeded. Maybe we need to check for this value also in the case above.

taganaka added the question label Jun 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitelist start urls? #43

Whitelist start urls? #43

janpieper commented Jun 26, 2014

tmaier commented Jun 26, 2014

Whitelist start urls? #43

Whitelist start urls? #43

Comments

janpieper commented Jun 26, 2014

tmaier commented Jun 26, 2014