Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-filling the feb-may 2022 "dip" using canonical URL extraction #353

Open
philbudne opened this issue Nov 24, 2024 · 3 comments
Open

Re-filling the feb-may 2022 "dip" using canonical URL extraction #353

philbudne opened this issue Nov 24, 2024 · 3 comments

Comments

@philbudne
Copy link
Contributor

Given the successful recovery using "blind fetching" S3 objects and using the extracted canonical URL, I suggested we might want to do the same thing for the period in 2022 (approx 2022-01-25 thru 2022-05-05?) where Xavier fetched all the S3 objects, looking only at the RSS files to extract URLs that we tried to fetch again (and found significant "link rot").

It looks like the researchers would prefer filling out 2022 to working on prior years (2019 and earlier).

To see if it might be a savings (assuming my memory that access to S3 is free from EC2 instances is correct) to scan the S3 objects from an EC2 instance, and pack them (without ANY further processing, including checking if the file has a canonical URL) into WARC files, I threw together a packer.py script with bits cribbed from hist-fetcher and parser (for RSS detection).

To get 100 HTML files it scanned 767 S3 objects (some non-existent, some 36 bytes or smaller which almost certainly contain a message that says it was a duplicate feed download), downloading a total of 5676225 bytes (avg 7400 bytes/obj scanned), and writing a WARC file that's 2726394 bytes (48% the size), so it might be worthwhile (with more consideration and math).

Then I created a t4g.nano instance ($0.0042/hr) to see how much faster downloads are from inside AWS, and it took about half the time (23 seconds vs 48seconds from ifill). That doesn't include additional time for the EC2 instance to copy the WARC file to S3.

Further data points:

My initial estimate (working only from the previous run of a one month period, and factoring in halving the speed to avoid hogging UMass bandwidth) was that it could take 56 days.

Poking around in the S3 bucket, it looks like the object ID range is about 113 million objects to be scanned; at 50 downloads/second (current historical ingest rate with 6 fetchers), that looks like it could be only 26 days.

So six "packers" (each given a share of the object ID range) running in EC2 at 33 obj/second is 200 obj/second, and 113Mobj divided by 200 obj/sec looks to be about a week of EC2 time.

A t3a.xlarge instance (4 AMD CPUs) is $0.15/hr, which would be $25 for a week (not counting EBS costs for the root disk).

Amazon pricing usually doubles for a doubling in resources, so the total price might be the same for different instance sizes, the instance size just determines the speed (assuming there isn't some other bottleneck).

With the 7400 bytes/obj number from above, at 113M objects, that's 836GB of download to transfer the raw objects,
The WARC file came in at 3555 bytes/object or 402GB to download.

Processing the packed WARC files should be much like any other historical ingest (altho it will require a different stack flavor), and I'd expect that we would be able to process at the same rate (a month every 4 days at 50 stories/second),
so 12 days. The arch-queuer shouldn't need any changes, and it can scan an S3 bucket for new additions, so the pipeline could run at the same time as the EC2 processing.

It looks like we transferred 2TB/mo out of AWS in Sept and October, that puts us in the $0.09/GB bracket, so a savings of 434GB would be $39, and at LEAST at $25 EC2 cost, means a savings of at most $14.

Running ad-hoc packers (as opposed to a rabbitmq based pipeline/stack) has the disadvantage that if the packer processes quit, they wouldn't be able to pick up where they left off without some record keeping. To get the RSS filtering capability we'd need a worker that does just that, or a parser option that says to do ONLY that!

One thing I haven't examined is if how many duplicate stories we might end up with (the canonical URL differs from the final URL we get when downloading using the RSS file URL); I haven't looked at whether we could delete the stories previously fetched using Xavier's CSV files. One way would be to look at the WARC files written when the CSVs were processed, but there might be other ways (looking at indexed_date and published_date?)

I used an ARM64 instance, initially with IPv6 only, running Ubuntu 24.04, and had some "fun":

  • github does not speak IPv6!
  • cchardet didn't want to build on Python 3.12
  • probably some other things I'm forgetting
@philbudne
Copy link
Contributor Author

Copied from a slack message thread:

To see how many object fetched "blind" (by object id) from Feb-April yield canonical URLs that are rejected as duplicate, I ran a batch of 15206 object id's in the production hist-indexer stack. Looking at the log file: 13182 that fetched from S3, were HTML and had a canonical URL, 8443 were rejected as dups (had likely been fetched when we fetched via CSV files Xavier had created from scooping RSS files from the same bucket), and 4723 were "new". Anyone have any thoughts on how to detect/quantify if we'd be getting dups?

Then I did a run of 50K S3 object IDs:

Diffing before and after counts:

 pbudne@tarbell:~/query$ diff -y 2022.[14] | grep '|' | sed 's/                                  //'
2022-01-20 607548      |	2022-01-20 607721
2022-01-21 580131      |	2022-01-21 580322
2022-01-22 389058      |	2022-01-22 389107
2022-01-23 378421      |	2022-01-23 378530
2022-01-24 576418      |	2022-01-24 576932
2022-01-25 426326      |	2022-01-25 433157
2022-01-26 242361      |	2022-01-26 243268
2022-02-08 313719      |	2022-02-08 313721
2022-04-01 392906      |	2022-04-01 392907
2022-05-01 168855      |	2022-05-01 168856
2022-05-04 255443      |	2022-05-04 255444
total 34921806	      |	total 34930585

And looking at the logs, 16939 created, 34370 rejected as duplicate, so 33% non-duplicate. ISTR the estimate was that the "dip" was about 40% of expected levels, so if this adds 50% to that, we'd still be below expected levels, which adds some comfort that it isn't massively duplicative. I'm going to restart 2020 on bernstein (about a week of processing left) to give time for thought & comment.

@philbudne
Copy link
Contributor Author

I took the WARC files generated above (16939 stories NOT rejected as duplicate URLs) and did searches for each one for stories in Elastic with the same canonical_domain and article_title with indexed_date:[2024-05-01 TO 2024-06-30] (the CSV based 2020 back-fill)

Attached is the program:
read.py.txt

and the output
2022.txt

@philbudne
Copy link
Contributor Author

@kilemensi observed that the first two new urls (canonical URLs extracted from S3 documents) resulted in redirects to an "old" URL (URL of a previously fetched story with the same title and canonical domain). Following on this, I ran the above output thru a program that took the canonical URL, tried to open it, took the final URL and processed it with normalize_url and then compared the result with normalize_url on each of the old articles' URLs.

Of the 2903 articles in 2022.txt, 1056 of them matched using the above test.
Removing trailing / from both normalized URLs, the number of matches went up to 1413.

An example of a Story that didn't come up as a match until removing a terminal slash:

canonical https://www.tvc.ru/news/show/id/231331
canonical normalized http://tvc.ru/news/show/id/231331
canonical final https://www.tvc.ru/news/show/id/231331
canonical final normalized http://tvc.ru/news/show/id/231331
old https://www.tvc.ru/news/show/id/231331/?utm_source=news.yandex.ru&utm_content=RSS&utm_campaign=yandex
old normalized http://tvc.ru/news/show/id/231331/

So 8% of the articles inserted using canonical URL (NOT immediately rejected as duplicate) look like they're dups, which isn't lovely (I have all the URLs from the test, and can remove them if need be).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant