hist-fetcher issues #328

philbudne · 2024-08-17T22:41:55Z

For "historical" data (where we have CSV files dumped from the legacy system with URL and downloads_id for objects on S3) the process is:

hist-queuer: reads the CSV file, keeping a set of URLs previously seen (the legacy system did not de-duplicate by URLs seen on different feeds), and queues a primordial Story object for the hist-fetcher.

hist-fetcher: tries to fetch the S3 object named by the downloads_id. In a staging run of September 2021, 1100 initial fetches failed (2.2% of the 50K queued).

Observations/questions:

hist-fetcher lets all S3 download errors be retried (good if this was a networking issue, a waste of RPC ops if not)
I wonder: are the objects missing for ALL downloads_ids for that URL?
hist-fetcher does not log the downloads_id before fetching, so we can't see in real time what is failing

I accidentally cleared out the hist-prod stack fetcher-quar queue of 38K entries from January 2021 (thinking I was clearing the hist-staging stack queues), so we can't use that to extract data to try to answer item 2 (see item 3 for why we have to wait 10 hours, and then examine the quarantine results).

The reasons the work is split into queuer/fetcher:

the latency for AWS retrievals is between 100 and 110ms, so the fetch rate of a single fetcher that read a CSV file would be limited
queuing the work allows the slow (network bound) fetching to be done by multiple workers
if fetches fail due to network (or other) interruption, the state of work done is retained, and failures can be retried.

IF it turns out that HTML is available for some downloads_ids and not others for the same URL, the simple solution is to remove the dedup in the queuer, which could also mean duplicating MANY fetches, if HTML is often available for more than one dl_id, and generally slower progress.

Since we only have one URL for historical stories, it's trivial to take a CSV file and test whether an article has been indexed (try looking it up by the hash of the normalized URL, and it might be possible to do backfill on the backfill.

The text was updated successfully, but these errors were encountered:

philbudne mentioned this issue Aug 17, 2024

re-index 2021 data #300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hist-fetcher issues #328

hist-fetcher issues #328

philbudne commented Aug 17, 2024

hist-fetcher issues #328

hist-fetcher issues #328

Comments

philbudne commented Aug 17, 2024