Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor/MTG-1254-json-downloader-service[mtg-144][mtg-526] #415

Draft
wants to merge 22 commits into
base: develop
Choose a base branch
from

Conversation

kstepanovdev
Copy link
Contributor

@kstepanovdev kstepanovdev commented Feb 11, 2025

The draft of the architecture is here:
 

jd-arch

The core idea is to refactor the tasks column, removing redundant data from it and adding new data such as etag to reduce the database workload.
New workflow at first tries to get etag or last-modified-date if such fields are present in a dedicated asset and etag hasn't been changed it will not be written to the db.
Another small feature is leveraging tokio::select without an outer loop.

@kstepanovdev kstepanovdev force-pushed the refactor/json-downloader-service branch from 6e85196 to ed41a1f Compare February 12, 2025 15:01
@kstepanovdev kstepanovdev changed the title Refactor/json downloader service refactor/[mtg-1254]-json-downloader-service Feb 14, 2025
@kstepanovdev kstepanovdev changed the title refactor/[mtg-1254]-json-downloader-service refactor/MTG-1254-json-downloader-service Feb 14, 2025
@kstepanovdev kstepanovdev marked this pull request as ready for review February 14, 2025 11:53
@kstepanovdev kstepanovdev changed the title refactor/MTG-1254-json-downloader-service refactor/MTG-1254-json-downloader-service[mtg-144][mtg-526] Feb 14, 2025
armyhaylenko
armyhaylenko previously approved these changes Feb 17, 2025
Copy link
Contributor

@armyhaylenko armyhaylenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LRFBTM

Copy link
Collaborator

@StanChe StanChe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge, thank you for the incredible work on it!
The PR should be refined though. Some major concerns around sql are in comments.
As an additional comment I'd like to highlight that we're no longer persisting the error we got during downloading, but it has some value, as we may distinguish some fixable errors (like a single/double quote as part of the uri), or some corner cases (local file path, dns errors, json directly embedded in the url)
Duplicating a comment from a ticket:
FYI, analyses of reasons of failure:

  AND tsk_error NOT LIKE '%source: TimedOut%'
  AND tsk_error NOT LIKE '%source: hyper::Error(Connect, ConnectError("dns error%' 
  AND tsk_error NOT LIKE '%source: hyper::Error(Connect, ConnectError("tcp connect%'
  AND tsk_error NOT LIKE '%source: hyper::Error(Connect, Ssl(Error { code: ErrorCode(%'
  AND tsk_error NOT LIKE '%source: TooManyRedirects%'
  AND tsk_error NOT LIKE '%source: hyper::Error(IncompleteMessage)%'
  AND tsk_error != 'Failed to parse URL: RelativeUrlWithoutBase'
  AND tsk_metadata_url NOT LIKE 'ipfs://%'
  AND tsk_metadata_url NOT LIKE 'ifs://%'
  AND tsk_metadata_url NOT LIKE 'data:application/json%'
  AND tsk_metadata_url NOT LIKE 'https:://%'
  AND tsk_metadata_url NOT LIKE 'ttps://%'
  AND tsk_metadata_url NOT LIKE 'hhttps://%'
  AND tsk_metadata_url NOT LIKE 'Ihttps://%'
  AND tsk_metadata_url NOT LIKE 'https:// %'
  AND tsk_metadata_url NOT LIKE 'file:%'
  AND tsk_metadata_url NOT LIKE '%"'
  limit 250;

That covers all but several hundred errors

ADD COLUMN etag text DEFAULT NULL,
ADD COLUMN last_modified_at timestamptz DEFAULT NULL,
ADD COLUMN mutability mutability NOT NULL DEFAULT 'mutable',
ADD COLUMN task_status task_status NOT NULL DEFAULT 'pending';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can/should we make the migration more "context-aware" for example, but keeping the success state, where it's present and keeping the last_modified_at to some old date? Or should this rather be run together with the JsonMigrator to restore the state in the DB? My main concern here is having to re-clarify (initial download flow) for the state of all 55+M urls while we're constantly receiving new ones as well

etag = tmp.etag,
last_modified_at = tmp.last_modified_at,
mutability = tmp.mutability,
next_refresh_at = NOW + INTERVAL '1 day'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we setting the next refresh/attempt time for immutable as well?


let query = query_builder.build();
query.execute(&self.pool).await?;

Ok(())
}

pub async fn get_pending_tasks(
pub async fn update_tasks_attempt_time(&self, data: Vec<String>) -> Result<(), IndexDbError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably as a next ticket would be to make it more flexible, like data:Vec<(String,Duration)> to consider the cache validity, or any other logic we want to put for the next retry

query_builder.push_bind(tasks_count);

// skip locked not to intersect with synchronizer work
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be an important comment here. You may get a deadlock without skipping locked. Here is how this may happen:

  • this select selects for update 100 records, it acquires lock on the first 50 and tries to acquire it on the 51st;
  • in a different process the synchronizer batch updates another 100 assets, 2 of which intersect, in the batch it has already updated 99 assets including the one this select tries to lock. Now it waits for the 100s which is being locked among other 50 here.
    Now we'll have a deadlock. I don't see this being addressed differently here.

"WITH selected_tasks AS (
SELECT t.metadata_hash FROM tasks AS t
WHERE t.task_status = 'success' AND NOW() > t.next_try_at AND t.mutability = 'mutable'
FOR UPDATE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same deadlocking potential here

@kstepanovdev kstepanovdev marked this pull request as draft February 18, 2025 16:49
@kstepanovdev kstepanovdev force-pushed the refactor/json-downloader-service branch from ec82874 to b85ffac Compare February 25, 2025 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants