Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-index 2021 data #300

Open
rahulbot opened this issue Jun 5, 2024 · 18 comments
Open

re-index 2021 data #300

rahulbot opened this issue Jun 5, 2024 · 18 comments
Assignees
Milestone

Comments

@rahulbot
Copy link
Contributor

rahulbot commented Jun 5, 2024

Once 2022 re-indexing is done (#271) we should start on 2021, continuing to work backwards chronologically. For all these dates I think we can ingest stories from previously-generated CSV files that refer to HTML files in the giant S3 bucket. Is this right?

This should include:

  • 2021-01-01 to 2021-11-21: Database B
  • 2021-11-21 to 2021-12-25: Database C (or E or F?)
  • 2021-12-25 to 2021-12-31: Database D

Ideally this would be done by July 1, but that depends on when 2022 finishes.

@philbudne
Copy link
Contributor

philbudne commented Jun 5, 2024 via email

@thepsalmist
Copy link
Contributor

Does anyone remember how the different versions came about?

The versions were from batching the CSV e.g 00:00-12:00 and 12:00 to 23:59 to avoid Postgres query timeouts.
Script to combine the csvs to a single version should fit

@philbudne
Copy link
Contributor

philbudne commented Jun 12, 2024 via email

@philbudne
Copy link
Contributor

The hist- stack processing the "Database D" csv files (for 2022 and 2021) has completed processing of 2021/12/31 back to 2021/12/27. BUT hist-fetcher was unhappy with 2021/12/26 (looks like it tossed everything from that day into quarantine).

@pgulley pgulley modified the milestones: 4 - September, 3 - August Aug 7, 2024
@philbudne
Copy link
Contributor

philbudne commented Aug 9, 2024

The queuer processes files in reverse lexicographic (character set) order.
This is my analysis of the order of the chunks to process (from top to bottom this time) to process the year:

status bucket prefix object name format start end notes
done mediacloud-database-d-files stories_2021_mm_dd 2021/12/26 2021/12/31 *️
see below mediacloud-database-files /2021/stories_2021-1 stories_2021-mm-dd 2021/10/14 2021/12/25  † *️
see below mediacloud-database-c-files 2021_mm_dd.csv 2021/11/12 2021/11/21
see below mediacloud-database-b-files /stories_2021- stories_2021-mm-dd 2021/10/14 2021/11/11
see below mediacloud-database-b-files /stories_2021_ stories_2021_mm_dd 2021/09/15 2021/10/13 *️
done mediacloud-database-files /2021/stories_2021-0 stories_2021-mm-dd 2021/01/01 2021/09/14

*️NOTE: DB D/B overlap periods are 2021-09-15 thru 2021-11-11 (DB B) and 2021-12-26 thru 2022-01-25 (DB D)

† empty files for 1/31, 4/8 thru 4/12, missing 9/15 thru 10/13 (??)

see #329 for other ranges that need downloads_ids

december 25 to october 14

status start end notes
X 2021-11-12 2021-12-25 need downloads_ids (from DB F)
X 2021-11-21 2021-11-12 needs downloads_ids (from DB C)
running 2021-11-11 2021-09-14 available in mc-db-b-files AND mc-db-files? *️

@philbudne
Copy link
Contributor

philbudne commented Aug 9, 2024

I tested 10/1/2021 (epoch B) in my dev stack,
pulled main from upstream, merged main to staging, and launched a staging stack on bernstein:

./docker/deploy.sh -T historical -Y 2021 -I s3://mediacloud-database-files/2021/stories_2021-1

I also removed the production hist-indexer stack.

My normal steps when deploying ANY stack. After a few minutes:

  1. check for containers that were recently launched and have recently exited with non-zero status
  2. tail the messages.log file to look for "normal" operation (fetching, parsing, importing)
  3. watch grafana for 10 minutes, looking for: smooth, continuous operation and no bouncing in the numbers of containers

In this case, grafana showed fetcher activity, but no parser activity: the hist-fetcher had reported all stories as "bad-dlid"

@philbudne
Copy link
Contributor

The errors look like:

2024-08-09 19:55:08,237 82f8da64c4d1 hist-fetcher INFO: bad-dlid: EMPTY
2024-08-09 19:55:08,238 82f8da64c4d1 hist-fetcher INFO: quarantine: QuarantineException('bad-dlid')

I downloaded the csv file:

aws s3 cp s3://mediacloud-database-files/2021/stories_2021-12-25.csv .

and I don't see a downloads_id:

collect_date,stories_id,media_id,url
2021-12-25 08:18:56.292171,2147483646,272136,https://observador.pt/2021/12/25/covid-19-coordenador-cientifico-italiano-considera-reforco-da-vacina-crucial-contra-omicron/
2021-12-25 08:18:56.291037,2147483645,375830,https://www.sudouest.fr/gironde/gujan-mestras/bassin-d-arcachon-une-cabane-en-feu-a-gujan-mestras-7452333.php

Same for the 24th:

pbudne@ifill:~$ head -3 stories_2021-12-24.csv 
collect_date,stories_id,media_id,url
2021-12-24 23:49:57.083076,2147254999,655701,https://ulan.mk.ru/video/2021/12/25/pervaya-godovshhina-podnyatiya-andreevskogo-flaga-na-korvete-geroy-rossiyskoy-federacii-aldar-cydenzhapov.html
2021-12-24 23:49:57.014398,2147254998,655701,https://kavkaz.mk.ru/social/2021/12/25/student-stavropolskogo-filiala-rankhigs-prinyal-uchastie-v-forume-studaktiva.html

@philbudne
Copy link
Contributor

Looking at the other end of mediacloud-database-files/2021, at 2021-01-01:

There are three files, all have downloads_id:

phil@p27:~$ head -1 stories_2021-01-01*
==> stories_2021-01-01.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url

==> stories_2021-01-01_v2.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url

==> stories_2021-01-01_v3.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url

Across the three files, only 86% are unique:

phil@p27:~$ wc -l stories_2021-01-01*
   100001 stories_2021-01-01.csv
   328389 stories_2021-01-01_v2.csv
   257678 stories_2021-01-01_v3.csv
   686068 total
phil@p27:~$ sort -u stories_2021-01-01* | wc -l
588464

Looks like the overlap is between, not within the files:

phil@p27:~$ for x in stories_2021-01-01*; do
> echo $x; sort -u $x | wc -l
> done
stories_2021-01-01.csv
100001
stories_2021-01-01_v2.csv
328389
stories_2021-01-01_v3.csv
257678

@philbudne
Copy link
Contributor

philbudne commented Aug 10, 2024

Looks like downloads_id starts being present in mid November:

phil@p27:~$ head -1 stories_2021-11-*
==> stories_2021-11-11.csv <==
collect_date,stories_id,media_id,downloads_id,feeds_id,url

==> stories_2021-11-23.csv <==
collect_date,stories_id,media_id,url

Generated at different times (from different databases?):

phil@p27:~$ aws s3 ls s3://mediacloud-database-files/2021/| grep 2021.11
2022-11-23 08:23:01  171607723 stories_2021-11-01.csv
2022-11-23 08:23:04  190802596 stories_2021-11-02.csv
2022-11-23 08:23:04  192421552 stories_2021-11-03.csv
2022-11-23 08:23:06  187801571 stories_2021-11-04.csv
2022-11-23 08:23:07  177020925 stories_2021-11-05.csv
2022-11-23 08:23:18  129040813 stories_2021-11-06.csv
2022-11-23 08:23:22  138231824 stories_2021-11-07.csv
2022-11-23 08:23:22  190744853 stories_2021-11-08.csv
2022-11-23 08:23:23  196410284 stories_2021-11-09.csv
2022-11-23 08:23:27  199264106 stories_2021-11-10.csv
2022-11-23 08:23:32  189706034 stories_2021-11-11.csv
2023-02-17 00:53:13  190362479 stories_2021-11-23.csv
2023-02-17 00:53:13  257671902 stories_2021-11-24.csv
2023-02-17 00:53:13  160786079 stories_2021-11-25.csv
2023-02-17 00:53:13  151860953 stories_2021-11-26.csv
2023-02-17 00:53:13  109919256 stories_2021-11-27.csv
2023-02-17 00:53:16  112239747 stories_2021-11-28.csv
2023-02-17 00:53:16  163218600 stories_2021-11-29.csv
2023-02-17 00:53:16  181842769 stories_2021-11-30.csv

@pgulley
Copy link
Member

pgulley commented Aug 12, 2024

What does that download_id represent- is that just an index value from the old system? I assume some change to the indexer will be necessary in order to re-index without that value, but am I right to say that it's not really necessary in the new index?

@philbudne
Copy link
Contributor

@pgulley downloads_id is the key (object/file name) for the saved HTML in the mediacloud-downloads-backup S3 bucket (necessary to retrieve the HTML, otherwise not of use to the new system)

@pgulley
Copy link
Member

pgulley commented Aug 12, 2024

oh interesting- does that mean we don't have html saved for December 2021 then?

@philbudne
Copy link
Contributor

philbudne commented Aug 12, 2024 via email

@philbudne
Copy link
Contributor

Going back to my goal of testing my recent hist-fetcher fixes, I've launched a staging stack on bernstein for just October CSV files:

./docker/deploy.sh -T historical -Y 2021 -I s3://mediacloud-database-files/2021/stories_2021-10

@philbudne
Copy link
Contributor

And now running a staging stack (50K stories) for a date inside DB-B dl_id overlap range:

./docker/deploy.sh -T historical -Y 2021 -I s3://mediacloud-database-b-files/stories_2021_10_01.csv

@philbudne
Copy link
Contributor

January 2021 has finished. Just merged main to staging, and launched a staging stack on bernstein:

root@bernstein:/nfs/ang/users/pbudne/story-indexer# ./docker/deploy.sh -T historical -Y 2021 -H /stories_2021-0
upstream staging branch up to date.
creating docker-compose.yml
cloning story-indexer-config repo
QUEUER_ARGS --force --sample-size 50000 s3://mediacloud-database-files/2021/stories_2021-0

Logs show it starting from mid-September:

2024-08-17 21:35:58,500 1d54803aa1ef hist-queuer INFO: process_file s3://mediacloud-database-files/2021/stories_2021-09-14.csv

@philbudne
Copy link
Contributor

Created #328 on some observations about hist ingest.

@pgulley pgulley modified the milestones: 3 - August, 4 - September Aug 28, 2024
@philbudne
Copy link
Contributor

Looking at the remaining 2021 data hole:
image

(date range 2021-11-19 thru 2021-12-26). The HTML (and possibly RSS) files are on S3:

downloads_id isodate unix_ts S3 Version Id
3306845683 2021-11-18T23:59:57+00:00 1637279997.0
3306845686 2021-11-18T23:59:59+00:00 1637279999.0
3306845687 2021-11-19T00:00:00+00:00 1637280000.0
3306845689 2021-11-19T00:00:02+00:00 1637280002.0
3360714567 2021-12-26T10:32:56+00:00 1640514776.0
3360714571 2021-12-26T10:32:56+00:00 1640514776.0
3360714572 2021-12-26T10:32:55+00:00 1640514775.0 8wiLKAcAGi5E14BySGDraSCl6fSe00DW
3360714573 2021-12-26T10:33:09+00:00 1640514789.0 DhkyUneyXRbliInFgDRV8yKyh6ggM1dz
3360714572 2022-01-29T00:21:20+00:00 1643415680.0 hWCsDiDn6QouyLe89IS6cU2oMrqIWzZ8
3360714573 2022-01-29T00:21:28+00:00 1643415688.0 vrbK1TSlTaYa5raJrLWpi6k0J0xh5uT2
3360714574 2022-01-29T00:21:21+00:00 1643415681.0
3360714577 2022-01-29T00:21:20+00:00 1643415680.0

With, or without the RSS, it may be possible to recover some significant percentage of the HTML using "cannonical link" tags...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Status: No status
Development

No branches or pull requests

4 participants