You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For "historical" data (where we have CSV files dumped from the legacy system with URL and downloads_id for objects on S3) the process is:
hist-queuer: reads the CSV file, keeping a set of URLs previously seen (the legacy system did not de-duplicate by URLs seen on different feeds), and queues a primordial Story object for the hist-fetcher.
hist-fetcher: tries to fetch the S3 object named by the downloads_id. In a staging run of September 2021, 1100 initial fetches failed (2.2% of the 50K queued).
Observations/questions:
hist-fetcher lets all S3 download errors be retried (good if this was a networking issue, a waste of RPC ops if not)
I wonder: are the objects missing for ALLdownloads_ids for that URL?
hist-fetcher does not log the downloads_id before fetching, so we can't see in real time what is failing
I accidentally cleared out the hist-prod stack fetcher-quar queue of 38K entries from January 2021 (thinking I was clearing the hist-staging stack queues), so we can't use that to extract data to try to answer item 2 (see item 3 for why we have to wait 10 hours, and then examine the quarantine results).
The reasons the work is split into queuer/fetcher:
the latency for AWS retrievals is between 100 and 110ms, so the fetch rate of a single fetcher that read a CSV file would be limited
queuing the work allows the slow (network bound) fetching to be done by multiple workers
if fetches fail due to network (or other) interruption, the state of work done is retained, and failures can be retried.
IF it turns out that HTML is available for some downloads_ids and not others for the same URL, the simple solution is to remove the dedup in the queuer, which could also mean duplicating MANY fetches, if HTML is often available for more than one dl_id, and generally slower progress.
Since we only have one URL for historical stories, it's trivial to take a CSV file and test whether an article has been indexed (try looking it up by the hash of the normalized URL, and it might be possible to do backfill on the backfill.
The text was updated successfully, but these errors were encountered:
For "historical" data (where we have CSV files dumped from the legacy system with URL and
downloads_id
for objects on S3) the process is:hist-queuer: reads the CSV file, keeping a set of URLs previously seen (the legacy system did not de-duplicate by URLs seen on different feeds), and queues a primordial Story object for the hist-fetcher.
hist-fetcher: tries to fetch the S3 object named by the downloads_id. In a staging run of September 2021, 1100 initial fetches failed (2.2% of the 50K queued).
Observations/questions:
downloads_id
s for that URL?downloads_id
before fetching, so we can't see in real time what is failingI accidentally cleared out the hist-prod stack fetcher-quar queue of 38K entries from January 2021 (thinking I was clearing the hist-staging stack queues), so we can't use that to extract data to try to answer item 2 (see item 3 for why we have to wait 10 hours, and then examine the quarantine results).
The reasons the work is split into queuer/fetcher:
IF it turns out that HTML is available for some downloads_ids and not others for the same URL, the simple solution is to remove the dedup in the queuer, which could also mean duplicating MANY fetches, if HTML is often available for more than one dl_id, and generally slower progress.
Since we only have one URL for historical stories, it's trivial to take a CSV file and test whether an article has been indexed (try looking it up by the hash of the normalized URL, and it might be possible to do backfill on the backfill.
The text was updated successfully, but these errors were encountered: