Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track url added by #add_url #33

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

tmaier
Copy link
Contributor

@tmaier tmaier commented Jun 6, 2014

I found urls added by #add_url (manually or at the beginning of #takeover) are not tracked by url_tracker

This commit adds url_tracker.visit to #add_url

My question is, what is after this change the difference between #add_url and the private method #enqueue?

I have a second question regarding #enqueue.

def(enqueue url_to_visit, current_page, queue)

why does #enqueue require a queue as attribute? Would it not be possible to use @internal_queue?

@coveralls
Copy link

Coverage Status

Coverage decreased (-7.16%) when pulling 5180dac on tmaier:patch-1 into 7de6421 on taganaka:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.21%) when pulling ad02f75 on tmaier:patch-1 into 7de6421 on taganaka:master.

@taganaka
Copy link
Owner

taganaka commented Jun 7, 2014

Answering in random order 😛

why does #enqueue require a queue as attribute? Would it not be possible to use @internal_queue?

Should be absolutely safe to use @internal_queue here since Redis client is thread safe.
Also, being #enqueue a private method, the change should not break any code out of there.

We could also get rid of the use of @http_pool and @queues_pool since each thread will use its local instance [1]

I found urls added by #add_url (manually or at the beginning of #takeover) are not tracked by url_tracker

I added add_url method as shorthand to enqueue an url regardless its state into the tracker.

The tracker is mainly used as a fast way to determine whether an url is already enqueued or it is about to be enqueued by another thread since there is no an easy fast and memory efficient way to enumerate elements stored into a Redis LinkediList (it is a O(N) operation).

I ran into some cases where having the queue pre-seeded by some external process, not necessarily part of a crawling session, with a large number of urls was useful.
I found useful, for instance, re-seed the queue with some buckets of urls just to keep the stored data selectively up to date. Hence I would keep add_url as is, as an easy way to enqueue an url skipping the track part.

Though you are right, seeded urls should be tracked by the tracker!
The p_seeded hack will guarantee pages used as initial seed will be re-evaluated but yes, those pages should be added to the tracker.
We may add yield(page) if block_given? into the enqueue method and change add_url to enqueue

[1]https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L108-L110

@tmaier tmaier mentioned this pull request Jun 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants