-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use non-resumable downloading for actor bundles to avoid CI test flakyness #5286
Conversation
1e34baf
to
5020fce
Compare
Co-authored-by: Hubert <[email protected]>
Co-authored-by: Hubert <[email protected]>
@@ -87,6 +89,7 @@ pub async fn load_actor_bundles_from_server( | |||
network: &NetworkChain, | |||
bundles: &[ActorBundleInfo], | |||
) -> anyhow::Result<Vec<Cid>> { | |||
let semaphore = Arc::new(Semaphore::new(4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 4 in particular? And not 3 or 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think any value between 2-8 is fine, 4 is somewhere in the middle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 9 is not fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Larger number is faster but more error-prone. Previously, all tasks run in parallel and the test was quite flaky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the time cost of the test, 4 makes it more stable and still fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can try, but I'm not 💯 convinced. I'd rather we have a deterministically correct and working solution. This seems like reducing error probability from an unknown chance to another unknown (but likely smaller) chance, while always paying the price of reduced performance (slower startup for fresh nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been using the same logic for RPC snapshot test and it's much more stable, the only notable difference is, actor bundle uses Github assets primarily while RPC snapshot test uses DO space CDN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's a common practice to limit max concurrent downloads, e.g. the author uses 8 in https://patshaughnessy.net/2020/1/20/downloading-100000-files-using-async-rust
And I think the number really depends on the network environment between client and remote server
Summary of changes
To mitigate #5287
The previous effort did not fully fix the issue
Changes introduced in this pull request:
Reference issue to close (if applicable)
Closes
Other information and links
Change checklist