Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A quick look at some quarantined (missing) stories from 2021 historical CSVs #330

Open
philbudne opened this issue Aug 23, 2024 · 0 comments

Comments

@philbudne
Copy link
Contributor

his is tarbell:/space/tmp/2021/fetcher/00README

This directory contains 89570 stories dumped from hist-indexer
fetcher-quar queue while processing early 2021 (outside the DB B/D
overlap dates).

Presumably all CSV entries that are missing S3 objects (failed
repeated fetch attempts).

My question:

historical CSV files contain many entries for the same URL
(fetcher for different sources). Are the S3 objects missing
for all downloads_id's with a given URL???

One estimate I made was that 2.2% of stories were failing to be found on S3.

Phil

P.S.
The ....warc.gz files can be read with "zmore":

The initial record is a header for the entire file,
followed by pairs of "response" and "metadata" records.

WARC-Target-URI header (in both WARC records) shows the original URL.

For one entry:

https://www.stuff.co.nz/entertainment/bravo/the-dish/123884771/porsha-williams-rings-in-the-new-year-dancing-on-a-yacht-in-the-pursuit-of-porsha.html

The "response" sections all contain "HTTP/1.0 None HUH?" which
indicates the HTTP response value in the Story object was invalid
(expected for failed fetches).

metadata:

  • "via" indicates the full CSV file path on S3
  • "link" contains the downloads_id from the CSV file
  • "source_feed_id" and "source_source_id" are from the CSV file feeds_id and media_id columns
  "rss_entry": {
    "link": "2865757039",
    "title": null,
    "domain": null,
    "pub_date": null,
    "fetch_date": null,
    "source_url": null,
    "source_feed_id": 1805781,
    "source_source_id": 656943,
    "via": "s3://mediacloud-database-files/2021/stories_2021-01-06.csv"
  },

I've downloaded s3://mediacloud-database-files/2021/stories_2021-01-06.csv
and found two rows with the above article URL:

collect_date,stories_id,media_id,downloads_id,feeds_id,url
2021-01-06 00:06:17.320858,1816021696,656943,2865757039,1805781,https://www.stuff.co.nz/entertainment/bravo/the-dish/123884771/porsha-williams-rings-in-the-new-year-dancing-on-a-yacht-in-the-pursuit-of-porsha.html
2021-01-06 00:11:46.868282,1816022343,622809,2865757686,1805757,https://www.stuff.co.nz/entertainment/bravo/the-dish/123884771/porsha-williams-rings-in-the-new-year-dancing-on-a-yacht-in-the-pursuit-of-porsha.html

In this case neither object seems to exist:

pbudne@tarbell:/space/tmp/2021/fetcher$ aws s3 ls mediacloud-downloads-backup/downloads/2865757686
pbudne@tarbell:/space/tmp/2021/fetcher$ aws s3 ls mediacloud-downloads-backup/downloads/2865757039

Looking at the above csv file, I see 12140 URLs that appear more than
once, so plenty more examples to look at!

root@tarbell:/space/tmp/2021/fetcher# awk -F, '{print $6}' stories_2021-01-06.csv | sort | uniq -d | wc -l
12140

URLs from the CSV with the maximum dups:

root@tarbell:/space/tmp/2021/fetcher# awk -F, '{print $6}' stories_2021-01-06.csv | sort | uniq -c | sort -rn | head
     29 "https://www.morgenweb.de/mannheimer-morgen_artikel
     27 https://www.mk.ru/social/2021/01/06/ushyol-iz-zhizni-eksgubernator-kamchatki-vladimir-biryukov.html
     27 https://aif.ru/sport/hockey/kargo-kult_larionova_pochemu_porazhenie_sbornoy_rossii_eto_zakonomernost
     27 https://aif.ru/society/safety/moshenniki_poluchayut_dannye_bankovskih_kart_rassylaya_pisma_ot_imeni_uber
     27 https://aif.ru/society/army/skolko_v_rossii_boevyh_samoletov
     27 https://aif.ru/health/dietolog_razveyal_mif_o_vrede_chipsov
     26 https://www.mitti.se/nyheter/pistol-som-avlossats-pa-malmvagen-lag-i-kvinnas-tvattmaskin/repuaf!d06gACR9QY7FIlIkxtQRnQ/
     26 https://www.mitti.se/nyheter/hon-slog-larm-om-hemtjanst-utan-skydd/reptlr!sUrHxXS9Tan7u8ah1WY7lg/
     26 https://aif.ru/society/socopros_pokazal_chto_rossiyane_zhdut_povysheniya_zarplat_v_2021_godu
     26 https://aif.ru/society/science/v_bolgarii_razrabotali_toplivo_dlya_raket_iz_chereshni

Looking at all URLs in the WARC files:

root@tarbell:/space/tmp/2021/fetcher# zgrep -h WARC-Target-URI *.warc.gz | sed 's/WARC-Target-URI: //' | sort > all-missing-urls

I expect all URLs to appear at least twice (once for each of the two
WARC records per Story), but some URLs appear 4 or 6 times, this could
be because the URLs appeared in different CSV files and the duplicates
could not be filtered out, or program error (the system might be
processing messages "at least once", rather than "exactly once").
Closer examination of the files (since they contain the CSV file name
could be used to check the first theory.

root@tarbell:/space/tmp/2021/fetcher# uniq -c !$ | sort -rn | head
uniq -c all-missing-urls | sort -rn | head
      6 https://www.bignewsnetwork.com/news/268201607/jamieson-fined-15-pc-match-fees?utm_source=feeds.bignewsnetwork.com&utm_medium=referral
      6 https://www.bignewsnetwork.com/news/268175435/students-racist-slurs-outrage-community?utm_source=feeds.bignewsnetwork.com&utm_medium=referral
      6 https://www.bignewsnetwork.com/news/268157313/chiefs-sign-guard-joe-thuney?utm_source=feeds.bignewsnetwork.com&utm_medium=referral
      6 https://www.bignewsnetwork.com/news/268157102/haryana-win-hockey-india-women-national-championship-2021?utm_source=feeds.bignewsnetwork.com&utm_medium=referral
      6 https://www.bignewsnetwork.com/news/268155905/asked-and-answered-march-18?utm_source=feeds.bignewsnetwork.com&utm_medium=referral
      6 http://ici.radio-canada.ca/Medianet/2021/cbxft/2021-01-02_18_00_00_tjalb_0000_01_500.asx
      6 http://ici.radio-canada.ca/Medianet/2021/RDI/2021-0319-2047_500.asx
      4 https://www.zz.lv/vietejas-zinas/valsts-lielako-dalu-parklajusi-sniega-sega-256437
      4 https://www.ysusports.com/sports/fball/2020-21/releases/und-preview
      4 https://www.westislandblog.com/a-st-leonard-church-defies-quebecs-public-health-measure/

Finally, the WARCs contain stories from 56 CSV files:
root@tarbell:/space/tmp/2021/fetcher# zgrep -h '"via"' *.gz | sort -u | wc -l
56

That includes 7 that had a _vN in the object name:

root@tarbell:/space/tmp/2021/fetcher# zgrep -h '"via"' *.gz | sort -u | grep _v     
    "via": "s3://mediacloud-database-files/2021/stories_2021-02-04_v3.csv"
    "via": "s3://mediacloud-database-files/2021/stories_2021-02-07_v2.csv"
    "via": "s3://mediacloud-database-files/2021/stories_2021-02-18_v2.csv"
    "via": "s3://mediacloud-database-files/2021/stories_2021-02-19_v2.csv"
    "via": "s3://mediacloud-database-files/2021/stories_2021-02-22_v3.csv"
    "via": "s3://mediacloud-database-files/2021/stories_2021-02-26_v2.csv"
    "via": "s3://mediacloud-database-files/2021/stories_2021-03-27_v3.csv"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant