-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re-index 2021 data #300
Comments
It's my recall that there are no holes in the 2021 record, HOWEVER:
1. There's the "overlap" period, where two instances of the system
used the same range(s) of database ids.
There is code in hist-fetcher.py to handle this (and copious
comments), but I HAVE NOT TESTED IT. My recall is that I look at
the date in the CSV file and determine which database Epoch that
corresponds to (B or D), and then look at all versions of the S3
object for that downloads_id, and pick (the?) one that was written
in the same epoch/time-period WITHOUT checking that the dates are
close/sane.
In other words, it requires some examination before letting it rip.
2. There are 589 daily CSV files in file s3://mediacloud-database-files/2021/
It looks like there are (up to?) three versions of each day for
dates between 2021-01-01 and 2021-05-15.
Except for the dates where all three files are all trivial (59
bytes, presumably just a column/header line), the three files
seem to have different sizes. Here is a snip of "aws s3 ls"
output:
2022-12-28 11:43:52 16234828 stories_2021-04-05.csv
2022-12-28 11:43:53 80451582 stories_2021-04-05_v2.csv
2022-12-28 11:43:53 72063717 stories_2021-04-05_v3.csv
2022-12-28 11:43:54 16485868 stories_2021-04-06.csv
2022-12-28 11:43:56 96556995 stories_2021-04-06_v2.csv
2022-12-28 11:43:57 78333686 stories_2021-04-06_v3.csv
2022-12-28 11:43:57 16470124 stories_2021-04-07.csv
2022-12-28 11:43:59 97247724 stories_2021-04-07_v2.csv
2022-12-28 11:43:59 77504728 stories_2021-04-07_v3.csv
2022-12-28 11:43:59 16496389 stories_2021-04-08.csv
2022-12-28 11:44:02 79591431 stories_2021-04-08_v2.csv
2022-12-28 11:44:04 59 stories_2021-04-08_v3.csv
2022-11-23 01:14:05 59 stories_2021-04-09.csv
2022-12-28 11:44:05 59 stories_2021-04-09_v2.csv
2022-12-28 11:44:05 59 stories_2021-04-09_v3.csv
2022-11-23 01:14:05 59 stories_2021-04-10.csv
2022-12-28 11:44:05 59 stories_2021-04-10_v2.csv
2022-12-28 11:44:05 59 stories_2021-04-10_v3.csv
2022-11-23 01:14:06 59 stories_2021-04-11.csv
2022-12-28 11:44:05 59 stories_2021-04-11_v2.csv
2022-12-28 11:44:06 59 stories_2021-04-11_v3.csv
2022-11-23 01:14:06 59 stories_2021-04-12.csv
2022-12-28 11:44:06 59 stories_2021-04-12_v2.csv
2022-12-28 11:44:06 59 stories_2021-04-12_v3.csv
2022-12-28 11:44:06 16910341 stories_2021-04-13.csv
2022-12-28 11:44:06 59 stories_2021-04-13_v2.csv
2022-12-28 11:44:06 154798430 stories_2021-04-13_v3.csv
2022-12-28 11:44:06 17830103 stories_2021-04-14.csv
2022-12-28 11:44:06 117418301 stories_2021-04-14_v2.csv
2022-12-28 11:44:08 76491043 stories_2021-04-14_v3.csv
The "v2" file seems to be the largest in MOST cases, but
see 4-13 above as an exception.
If we process more than one file for each date, it seems
possible/likely that we could download each HTML file as many as
three times.
hist-queuer.py eliminates downloading S3 objects that are for the
same remote URL (the old system downloaded a story each time it
appeared in a different feed), but cannot look across input csv
files.
Does anyone remember how the different versions came about?
|
The versions were from batching the CSV e.g 00:00-12:00 and 12:00 to 23:59 to avoid Postgres query timeouts. |
Script to combine the csvs to a single version should fit
Not strictly necessary: the queuer doesn't check file suffixes (.csv).
The only advantage would be eliminating duplicates (the legacy system
downloaded a story multiple times if it appeared in multiple RSS feeds).
|
The hist- stack processing the "Database D" csv files (for 2022 and 2021) has completed processing of 2021/12/31 back to 2021/12/27. BUT hist-fetcher was unhappy with 2021/12/26 (looks like it tossed everything from that day into quarantine). |
The queuer processes files in reverse lexicographic (character set) order.
*️NOTE: DB D/B overlap periods are 2021-09-15 thru 2021-11-11 (DB B) and 2021-12-26 thru 2022-01-25 (DB D) † empty files for 1/31, 4/8 thru 4/12, missing 9/15 thru 10/13 (??) see #329 for other ranges that need downloads_ids december 25 to october 14
|
I tested 10/1/2021 (epoch B) in my dev stack,
I also removed the production hist-indexer stack. My normal steps when deploying ANY stack. After a few minutes:
In this case, grafana showed fetcher activity, but no parser activity: the hist-fetcher had reported all stories as "bad-dlid" |
The errors look like:
I downloaded the csv file:
and I don't see a downloads_id:
Same for the 24th:
|
Looking at the other end of mediacloud-database-files/2021, at 2021-01-01: There are three files, all have downloads_id:
Across the three files, only 86% are unique:
Looks like the overlap is between, not within the files:
|
Looks like downloads_id starts being present in mid November:
Generated at different times (from different databases?):
|
What does that download_id represent- is that just an index value from the old system? I assume some change to the indexer will be necessary in order to re-index without that value, but am I right to say that it's not really necessary in the new index? |
@pgulley downloads_id is the key (object/file name) for the saved HTML in the mediacloud-downloads-backup S3 bucket (necessary to retrieve the HTML, otherwise not of use to the new system) |
oh interesting- does that mean we don't have html saved for December 2021 then? |
oh interesting- does that mean we don't have html saved for December 2021 then?
The HTML is on S3, we just don't have the keys to retrieve it ready to
use. The question is whether we still have backups of (one of) the PG
database(s) that can link the downloads_id, URL and date range.
|
Going back to my goal of testing my recent hist-fetcher fixes, I've launched a staging stack on bernstein for just October CSV files:
|
And now running a staging stack (50K stories) for a date inside DB-B dl_id overlap range:
|
January 2021 has finished. Just merged main to staging, and launched a staging stack on bernstein:
Logs show it starting from mid-September:
|
Created #328 on some observations about hist ingest. |
Once 2022 re-indexing is done (#271) we should start on 2021, continuing to work backwards chronologically. For all these dates I think we can ingest stories from previously-generated CSV files that refer to HTML files in the giant S3 bucket. Is this right?
This should include:
Ideally this would be done by July 1, but that depends on when 2022 finishes.
The text was updated successfully, but these errors were encountered: