You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This directory contains 89570 stories dumped from hist-indexer
fetcher-quar queue while processing early 2021 (outside the DB B/D
overlap dates).
Presumably all CSV entries that are missing S3 objects (failed
repeated fetch attempts).
My question:
historical CSV files contain many entries for the same URL
(fetcher for different sources). Are the S3 objects missing
for all downloads_id's with a given URL???
One estimate I made was that 2.2% of stories were failing to be found on S3.
Phil
P.S.
The ....warc.gz files can be read with "zmore":
The initial record is a header for the entire file,
followed by pairs of "response" and "metadata" records.
WARC-Target-URI header (in both WARC records) shows the original URL.
The "response" sections all contain "HTTP/1.0 None HUH?" which
indicates the HTTP response value in the Story object was invalid
(expected for failed fetches).
metadata:
"via" indicates the full CSV file path on S3
"link" contains the downloads_id from the CSV file
"source_feed_id" and "source_source_id" are from the CSV file feeds_id and media_id columns
pbudne@tarbell:/space/tmp/2021/fetcher$ aws s3 ls mediacloud-downloads-backup/downloads/2865757686
pbudne@tarbell:/space/tmp/2021/fetcher$ aws s3 ls mediacloud-downloads-backup/downloads/2865757039
Looking at the above csv file, I see 12140 URLs that appear more than
once, so plenty more examples to look at!
I expect all URLs to appear at least twice (once for each of the two
WARC records per Story), but some URLs appear 4 or 6 times, this could
be because the URLs appeared in different CSV files and the duplicates
could not be filtered out, or program error (the system might be
processing messages "at least once", rather than "exactly once").
Closer examination of the files (since they contain the CSV file name
could be used to check the first theory.
his is tarbell:/space/tmp/2021/fetcher/00README
This directory contains 89570 stories dumped from hist-indexer
fetcher-quar queue while processing early 2021 (outside the DB B/D
overlap dates).
Presumably all CSV entries that are missing S3 objects (failed
repeated fetch attempts).
My question:
historical CSV files contain many entries for the same URL
(fetcher for different sources). Are the S3 objects missing
for all downloads_id's with a given URL???
One estimate I made was that 2.2% of stories were failing to be found on S3.
Phil
P.S.
The ....warc.gz files can be read with "zmore":
The initial record is a header for the entire file,
followed by pairs of "response" and "metadata" records.
WARC-Target-URI header (in both WARC records) shows the original URL.
For one entry:
The "response" sections all contain "HTTP/1.0 None HUH?" which
indicates the HTTP response value in the Story object was invalid
(expected for failed fetches).
metadata:
I've downloaded s3://mediacloud-database-files/2021/stories_2021-01-06.csv
and found two rows with the above article URL:
In this case neither object seems to exist:
Looking at the above csv file, I see 12140 URLs that appear more than
once, so plenty more examples to look at!
URLs from the CSV with the maximum dups:
Looking at all URLs in the WARC files:
I expect all URLs to appear at least twice (once for each of the two
WARC records per Story), but some URLs appear 4 or 6 times, this could
be because the URLs appeared in different CSV files and the duplicates
could not be filtered out, or program error (the system might be
processing messages "at least once", rather than "exactly once").
Closer examination of the files (since they contain the CSV file name
could be used to check the first theory.
Finally, the WARCs contain stories from 56 CSV files:
root@tarbell:/space/tmp/2021/fetcher# zgrep -h '"via"' *.gz | sort -u | wc -l
56
That includes 7 that had a _vN in the object name:
The text was updated successfully, but these errors were encountered: