refactor/MTG-1254-json-downloader-service[mtg-144][mtg-526] #415

kstepanovdev · 2025-02-11T15:23:38Z

The draft of the architecture is here:

The core idea is to refactor the tasks column, removing redundant data from it and adding new data such as etag to reduce the database workload.
New workflow at first tries to get etag or last-modified-date if such fields are present in a dedicated asset and etag hasn't been changed it will not be written to the db.
Another small feature is leveraging tokio::select without an outer loop.

nft_ingester/src/metadata_workers/downloader.rs

nft_ingester/src/metadata_workers/json_worker.rs

postgre-client/src/tasks.rs

postgre-client/tests/json_tasks_test.rs

nft_ingester/src/metadata_workers/json_worker.rs

nft_ingester/src/bin/ingester/main.rs

nft_ingester/src/metadata_workers/downloader.rs

nft_ingester/src/metadata_workers/json_worker.rs

nft_ingester/src/metadata_workers/persister.rs

nft_ingester/src/metadata_workers/streamer.rs

armyhaylenko

LRFBTM

StanChe

This is huge, thank you for the incredible work on it!
The PR should be refined though. Some major concerns around sql are in comments.
As an additional comment I'd like to highlight that we're no longer persisting the error we got during downloading, but it has some value, as we may distinguish some fixable errors (like a single/double quote as part of the uri), or some corner cases (local file path, dns errors, json directly embedded in the url)
Duplicating a comment from a ticket:
FYI, analyses of reasons of failure:

  AND tsk_error NOT LIKE '%source: TimedOut%'
  AND tsk_error NOT LIKE '%source: hyper::Error(Connect, ConnectError("dns error%' 
  AND tsk_error NOT LIKE '%source: hyper::Error(Connect, ConnectError("tcp connect%'
  AND tsk_error NOT LIKE '%source: hyper::Error(Connect, Ssl(Error { code: ErrorCode(%'
  AND tsk_error NOT LIKE '%source: TooManyRedirects%'
  AND tsk_error NOT LIKE '%source: hyper::Error(IncompleteMessage)%'
  AND tsk_error != 'Failed to parse URL: RelativeUrlWithoutBase'
  AND tsk_metadata_url NOT LIKE 'ipfs://%'
  AND tsk_metadata_url NOT LIKE 'ifs://%'
  AND tsk_metadata_url NOT LIKE 'data:application/json%'
  AND tsk_metadata_url NOT LIKE 'https:://%'
  AND tsk_metadata_url NOT LIKE 'ttps://%'
  AND tsk_metadata_url NOT LIKE 'hhttps://%'
  AND tsk_metadata_url NOT LIKE 'Ihttps://%'
  AND tsk_metadata_url NOT LIKE 'https:// %'
  AND tsk_metadata_url NOT LIKE 'file:%'
  AND tsk_metadata_url NOT LIKE '%"'
  limit 250;

That covers all but several hundred errors

interface/src/json_metadata.rs

migrations/13_refactor_tasks.sql

StanChe · 2025-02-17T15:52:57Z

migrations/13_refactor_tasks.sql

+    ADD COLUMN etag text DEFAULT NULL,
+    ADD COLUMN last_modified_at timestamptz DEFAULT NULL,
+    ADD COLUMN mutability mutability NOT NULL DEFAULT 'mutable',
+    ADD COLUMN task_status task_status NOT NULL DEFAULT 'pending';


can/should we make the migration more "context-aware" for example, but keeping the success state, where it's present and keeping the last_modified_at to some old date? Or should this rather be run together with the JsonMigrator to restore the state in the DB? My main concern here is having to re-clarify (initial download flow) for the state of all 55+M urls while we're constantly receiving new ones as well

nft_ingester/src/metadata_workers/downloader.rs

nft_ingester/src/metadata_workers/json_worker.rs

StanChe · 2025-02-17T18:32:45Z

postgre-client/src/tasks.rs

+                etag = tmp.etag, 
+                last_modified_at = tmp.last_modified_at, 
+                mutability = tmp.mutability, 
+                next_refresh_at = NOW + INTERVAL '1 day' 


Are we setting the next refresh/attempt time for immutable as well?

StanChe · 2025-02-17T18:34:56Z

postgre-client/src/tasks.rs


        let query = query_builder.build();
        query.execute(&self.pool).await?;

        Ok(())
    }

-    pub async fn get_pending_tasks(
+    pub async fn update_tasks_attempt_time(&self, data: Vec<String>) -> Result<(), IndexDbError> {


Probably as a next ticket would be to make it more flexible, like data:Vec<(String,Duration)> to consider the cache validity, or any other logic we want to put for the next retry

StanChe · 2025-02-17T18:40:12Z

postgre-client/src/tasks.rs

        query_builder.push_bind(tasks_count);

-        // skip locked not to intersect with synchronizer work


This seems to be an important comment here. You may get a deadlock without skipping locked. Here is how this may happen:

this select selects for update 100 records, it acquires lock on the first 50 and tries to acquire it on the 51st;

in a different process the synchronizer batch updates another 100 assets, 2 of which intersect, in the batch it has already updated 99 assets including the one this select tries to lock. Now it waits for the 100s which is being locked among other 50 here.
Now we'll have a deadlock. I don't see this being addressed differently here.

postgre-client/src/tasks.rs

StanChe · 2025-02-17T18:42:40Z

postgre-client/src/tasks.rs

+            "WITH selected_tasks AS (
+                                    SELECT t.metadata_hash FROM tasks AS t
+                                    WHERE t.task_status = 'success' AND NOW() > t.next_try_at AND t.mutability = 'mutable'
+                                    FOR UPDATE


same deadlocking potential here

kstepanovdev force-pushed the refactor/json-downloader-service branch from 6e85196 to ed41a1f Compare February 12, 2025 15:01

kstepanovdev requested review from StanChe, n00m4d and armyhaylenko February 12, 2025 16:40

armyhaylenko reviewed Feb 13, 2025

View reviewed changes

kstepanovdev requested a review from armyhaylenko February 13, 2025 15:16

kstepanovdev changed the title ~~Refactor/json downloader service~~ refactor/[mtg-1254]-json-downloader-service Feb 14, 2025

kstepanovdev changed the title ~~refactor/[mtg-1254]-json-downloader-service~~ refactor/MTG-1254-json-downloader-service Feb 14, 2025

kstepanovdev marked this pull request as ready for review February 14, 2025 11:53

kstepanovdev changed the title ~~refactor/MTG-1254-json-downloader-service~~ refactor/MTG-1254-json-downloader-service[mtg-144][mtg-526] Feb 14, 2025

n00m4d reviewed Feb 14, 2025

View reviewed changes

nft_ingester/src/metadata_workers/json_worker.rs Outdated Show resolved Hide resolved

armyhaylenko reviewed Feb 17, 2025

View reviewed changes

kstepanovdev requested review from armyhaylenko and n00m4d February 17, 2025 10:32

armyhaylenko previously approved these changes Feb 17, 2025

View reviewed changes

kstepanovdev dismissed armyhaylenko’s stale review via 89b99d1 February 17, 2025 14:47

StanChe reviewed Feb 17, 2025

View reviewed changes

kstepanovdev marked this pull request as draft February 18, 2025 16:49

kstepanovdev added 12 commits February 25, 2025 18:29

Add pg migrations for new tasks structure

cc9eae2

Add Metadata Streamer

7ccd73f

Update tasks table & implement json downloader and streamer services

bc20c66

refactor persister

de2656c

add retry mechanism

40ae4f6

refactor persister

f3c7834

refactor the stuff

53f86b7

fix migrations and calls to new table structure

dbb6d11

fix types and pg requests

c9de2a4

add tests for json tasks selection && fix issues with db selection

89f0cad

fmt

ab0a248

fmt && clippy

8a15a16

kstepanovdev added 10 commits February 25, 2025 18:36

remove redundant channel in streamer

21921a5

fix tests && refactor

1b44b1e

Refactor & fix comments from the PR

68c690f

get rid of redundant spawning threads inside of functions

0d8f416

remove redundant comment

a482673

Fix comments from PR review

f3f3c86

work on comments

64da4d3

avoid deadlocks and force immutability

ca86e90

fix comments

530fe9c

rebase with develop

b85ffac

kstepanovdev force-pushed the refactor/json-downloader-service branch from ec82874 to b85ffac Compare February 25, 2025 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor/MTG-1254-json-downloader-service[mtg-144][mtg-526] #415

refactor/MTG-1254-json-downloader-service[mtg-144][mtg-526] #415

kstepanovdev commented Feb 11, 2025 •

edited

Loading

armyhaylenko left a comment

StanChe left a comment

StanChe Feb 17, 2025

StanChe Feb 17, 2025

StanChe Feb 17, 2025

StanChe Feb 17, 2025

StanChe Feb 17, 2025

		query_builder.push_bind(tasks_count);

		// skip locked not to intersect with synchronizer work

refactor/MTG-1254-json-downloader-service[mtg-144][mtg-526] #415

Are you sure you want to change the base?

refactor/MTG-1254-json-downloader-service[mtg-144][mtg-526] #415

Conversation

kstepanovdev commented Feb 11, 2025 • edited Loading

armyhaylenko left a comment

Choose a reason for hiding this comment

StanChe left a comment

Choose a reason for hiding this comment

StanChe Feb 17, 2025

Choose a reason for hiding this comment

StanChe Feb 17, 2025

Choose a reason for hiding this comment

StanChe Feb 17, 2025

Choose a reason for hiding this comment

StanChe Feb 17, 2025

Choose a reason for hiding this comment

StanChe Feb 17, 2025

Choose a reason for hiding this comment

kstepanovdev commented Feb 11, 2025 •

edited

Loading