-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor/MTG-1254-json-downloader-service[mtg-144][mtg-526] #415
base: develop
Are you sure you want to change the base?
Conversation
6e85196
to
ed41a1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LRFBTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is huge, thank you for the incredible work on it!
The PR should be refined though. Some major concerns around sql are in comments.
As an additional comment I'd like to highlight that we're no longer persisting the error we got during downloading, but it has some value, as we may distinguish some fixable errors (like a single/double quote as part of the uri), or some corner cases (local file path, dns errors, json directly embedded in the url)
Duplicating a comment from a ticket:
FYI, analyses of reasons of failure:
AND tsk_error NOT LIKE '%source: TimedOut%'
AND tsk_error NOT LIKE '%source: hyper::Error(Connect, ConnectError("dns error%'
AND tsk_error NOT LIKE '%source: hyper::Error(Connect, ConnectError("tcp connect%'
AND tsk_error NOT LIKE '%source: hyper::Error(Connect, Ssl(Error { code: ErrorCode(%'
AND tsk_error NOT LIKE '%source: TooManyRedirects%'
AND tsk_error NOT LIKE '%source: hyper::Error(IncompleteMessage)%'
AND tsk_error != 'Failed to parse URL: RelativeUrlWithoutBase'
AND tsk_metadata_url NOT LIKE 'ipfs://%'
AND tsk_metadata_url NOT LIKE 'ifs://%'
AND tsk_metadata_url NOT LIKE 'data:application/json%'
AND tsk_metadata_url NOT LIKE 'https:://%'
AND tsk_metadata_url NOT LIKE 'ttps://%'
AND tsk_metadata_url NOT LIKE 'hhttps://%'
AND tsk_metadata_url NOT LIKE 'Ihttps://%'
AND tsk_metadata_url NOT LIKE 'https:// %'
AND tsk_metadata_url NOT LIKE 'file:%'
AND tsk_metadata_url NOT LIKE '%"'
limit 250;
That covers all but several hundred errors
migrations/13_refactor_tasks.sql
Outdated
ADD COLUMN etag text DEFAULT NULL, | ||
ADD COLUMN last_modified_at timestamptz DEFAULT NULL, | ||
ADD COLUMN mutability mutability NOT NULL DEFAULT 'mutable', | ||
ADD COLUMN task_status task_status NOT NULL DEFAULT 'pending'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can/should we make the migration more "context-aware" for example, but keeping the success state, where it's present and keeping the last_modified_at to some old date? Or should this rather be run together with the JsonMigrator to restore the state in the DB? My main concern here is having to re-clarify (initial download flow) for the state of all 55+M urls while we're constantly receiving new ones as well
postgre-client/src/tasks.rs
Outdated
etag = tmp.etag, | ||
last_modified_at = tmp.last_modified_at, | ||
mutability = tmp.mutability, | ||
next_refresh_at = NOW + INTERVAL '1 day' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we setting the next refresh/attempt time for immutable as well?
|
||
let query = query_builder.build(); | ||
query.execute(&self.pool).await?; | ||
|
||
Ok(()) | ||
} | ||
|
||
pub async fn get_pending_tasks( | ||
pub async fn update_tasks_attempt_time(&self, data: Vec<String>) -> Result<(), IndexDbError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably as a next ticket would be to make it more flexible, like data:Vec<(String,Duration)>
to consider the cache validity, or any other logic we want to put for the next retry
query_builder.push_bind(tasks_count); | ||
|
||
// skip locked not to intersect with synchronizer work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be an important comment here. You may get a deadlock without skipping locked. Here is how this may happen:
- this select selects for update 100 records, it acquires lock on the first 50 and tries to acquire it on the 51st;
- in a different process the synchronizer batch updates another 100 assets, 2 of which intersect, in the batch it has already updated 99 assets including the one this select tries to lock. Now it waits for the 100s which is being locked among other 50 here.
Now we'll have a deadlock. I don't see this being addressed differently here.
postgre-client/src/tasks.rs
Outdated
"WITH selected_tasks AS ( | ||
SELECT t.metadata_hash FROM tasks AS t | ||
WHERE t.task_status = 'success' AND NOW() > t.next_try_at AND t.mutability = 'mutable' | ||
FOR UPDATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same deadlocking potential here
ec82874
to
b85ffac
Compare
The draft of the architecture is here:
The core idea is to refactor the
tasks
column, removing redundant data from it and adding new data such as etag to reduce the database workload.New workflow at first tries to get
etag
orlast-modified-date
if such fields are present in a dedicated asset and etag hasn't been changed it will not be written to the db.Another small feature is leveraging
tokio::select
without an outer loop.