Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hist-fetcher issues #328

Open
philbudne opened this issue Aug 17, 2024 · 0 comments
Open

hist-fetcher issues #328

philbudne opened this issue Aug 17, 2024 · 0 comments

Comments

@philbudne
Copy link
Contributor

For "historical" data (where we have CSV files dumped from the legacy system with URL and downloads_id for objects on S3) the process is:

hist-queuer: reads the CSV file, keeping a set of URLs previously seen (the legacy system did not de-duplicate by URLs seen on different feeds), and queues a primordial Story object for the hist-fetcher.

hist-fetcher: tries to fetch the S3 object named by the downloads_id. In a staging run of September 2021, 1100 initial fetches failed (2.2% of the 50K queued).

Observations/questions:

  1. hist-fetcher lets all S3 download errors be retried (good if this was a networking issue, a waste of RPC ops if not)
  2. I wonder: are the objects missing for ALL downloads_ids for that URL?
  3. hist-fetcher does not log the downloads_id before fetching, so we can't see in real time what is failing

I accidentally cleared out the hist-prod stack fetcher-quar queue of 38K entries from January 2021 (thinking I was clearing the hist-staging stack queues), so we can't use that to extract data to try to answer item 2 (see item 3 for why we have to wait 10 hours, and then examine the quarantine results).

The reasons the work is split into queuer/fetcher:

  1. the latency for AWS retrievals is between 100 and 110ms, so the fetch rate of a single fetcher that read a CSV file would be limited
  2. queuing the work allows the slow (network bound) fetching to be done by multiple workers
  3. if fetches fail due to network (or other) interruption, the state of work done is retained, and failures can be retried.

IF it turns out that HTML is available for some downloads_ids and not others for the same URL, the simple solution is to remove the dedup in the queuer, which could also mean duplicating MANY fetches, if HTML is often available for more than one dl_id, and generally slower progress.

Since we only have one URL for historical stories, it's trivial to take a CSV file and test whether an article has been indexed (try looking it up by the hash of the normalized URL, and it might be possible to do backfill on the backfill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant