Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitelist start urls? #43

Open
janpieper opened this issue Jun 26, 2014 · 1 comment
Open

Whitelist start urls? #43

janpieper opened this issue Jun 26, 2014 · 1 comment
Labels

Comments

@janpieper
Copy link
Contributor

If you use #follow_links_like and the given start urls does not match the configured regexps, the crawler stops working. Is there a reason why the start urls aren't whitelisted?

start_urls = [ "http://www.example.com/foo/bar" ]
Polipus.crawler("dummy", start_urls, options) do |crawler|
  crawler.follow_links_like(/\/bar\/foo/)
end

The links on the start page match the given regexp.

@tmaier
Copy link
Contributor

tmaier commented Jun 26, 2014

At https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L163 we check for #should_be_visited?. This is to allow skipping an url when the policy has changed during the crawl session but the page was already queued.

#should_be_visited? https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L351 returns false when the link does not match the pattern.

#page_exists? already checks for page.user_data.p_seeded. Maybe we need to check for this value also in the case above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants