Track url added by #add_url #33

tmaier · 2014-06-06T14:19:05Z

I found urls added by #add_url (manually or at the beginning of #takeover) are not tracked by url_tracker

This commit adds url_tracker.visit to #add_url

My question is, what is after this change the difference between #add_url and the private method #enqueue?

I have a second question regarding #enqueue.

def(enqueue url_to_visit, current_page, queue)

why does #enqueue require a queue as attribute? Would it not be possible to use @internal_queue?

coveralls · 2014-06-06T14:26:10Z

Coverage decreased (-7.16%) when pulling 5180dac on tmaier:patch-1 into 7de6421 on taganaka:master.

coveralls · 2014-06-06T15:20:21Z

Coverage decreased (-0.21%) when pulling ad02f75 on tmaier:patch-1 into 7de6421 on taganaka:master.

taganaka · 2014-06-07T05:24:02Z

Answering in random order 😛

why does #enqueue require a queue as attribute? Would it not be possible to use @internal_queue?

Should be absolutely safe to use @internal_queue here since Redis client is thread safe.
Also, being #enqueue a private method, the change should not break any code out of there.

We could also get rid of the use of @http_pool and @queues_pool since each thread will use its local instance [1]

I found urls added by #add_url (manually or at the beginning of #takeover) are not tracked by url_tracker

I added add_url method as shorthand to enqueue an url regardless its state into the tracker.

The tracker is mainly used as a fast way to determine whether an url is already enqueued or it is about to be enqueued by another thread since there is no an easy fast and memory efficient way to enumerate elements stored into a Redis LinkediList (it is a O(N) operation).

I ran into some cases where having the queue pre-seeded by some external process, not necessarily part of a crawling session, with a large number of urls was useful.
I found useful, for instance, re-seed the queue with some buckets of urls just to keep the stored data selectively up to date. Hence I would keep add_url as is, as an easy way to enqueue an url skipping the track part.

Though you are right, seeded urls should be tracked by the tracker!
The p_seeded hack will guarantee pages used as initial seed will be re-evaluated but yes, those pages should be added to the tracker.
We may add yield(page) if block_given? into the enqueue method and change add_url to enqueue

[1]https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L108-L110

Track url added by #add_url

5180dac

Fix #add_url

ad02f75

tmaier mentioned this pull request Jun 10, 2014

Refactor Crawler #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track url added by #add_url #33

Track url added by #add_url #33

tmaier commented Jun 6, 2014

coveralls commented Jun 6, 2014

coveralls commented Jun 6, 2014

taganaka commented Jun 7, 2014

Track url added by #add_url #33

Are you sure you want to change the base?

Track url added by #add_url #33

Conversation

tmaier commented Jun 6, 2014

coveralls commented Jun 6, 2014

coveralls commented Jun 6, 2014

taganaka commented Jun 7, 2014