Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use non-resumable downloading for actor bundles to avoid CI test flakyness #5286

Merged
merged 8 commits into from
Feb 13, 2025

Conversation

hanabi1224
Copy link
Contributor

@hanabi1224 hanabi1224 commented Feb 13, 2025

Summary of changes

To mitigate #5287
The previous effort did not fully fix the issue

Changes introduced in this pull request:

  • use non-resumable downloading for actor bundles to avoid CI test flakyness
  • limit max concurrent downloading jobs

Reference issue to close (if applicable)

Closes

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

@hanabi1224 hanabi1224 marked this pull request as ready for review February 13, 2025 11:18
@hanabi1224 hanabi1224 requested a review from a team as a code owner February 13, 2025 11:18
@hanabi1224 hanabi1224 requested review from lemmih and LesnyRumcajs and removed request for a team February 13, 2025 11:18
@hanabi1224 hanabi1224 force-pushed the hm/download-file-panic branch from 1e34baf to 5020fce Compare February 13, 2025 11:20
src/utils/net/download_file.rs Outdated Show resolved Hide resolved
@@ -87,6 +89,7 @@ pub async fn load_actor_bundles_from_server(
network: &NetworkChain,
bundles: &[ActorBundleInfo],
) -> anyhow::Result<Vec<Cid>> {
let semaphore = Arc::new(Semaphore::new(4));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 4 in particular? And not 3 or 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any value between 2-8 is fine, 4 is somewhere in the middle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 9 is not fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Larger number is faster but more error-prone. Previously, all tasks run in parallel and the test was quite flaky

Copy link
Contributor Author

@hanabi1224 hanabi1224 Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the time cost of the test, 4 makes it more stable and still fast

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try, but I'm not 💯 convinced. I'd rather we have a deterministically correct and working solution. This seems like reducing error probability from an unknown chance to another unknown (but likely smaller) chance, while always paying the price of reduced performance (slower startup for fresh nodes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been using the same logic for RPC snapshot test and it's much more stable, the only notable difference is, actor bundle uses Github assets primarily while RPC snapshot test uses DO space CDN.

Copy link
Contributor Author

@hanabi1224 hanabi1224 Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's a common practice to limit max concurrent downloads, e.g. the author uses 8 in https://patshaughnessy.net/2020/1/20/downloading-100000-files-using-async-rust
And I think the number really depends on the network environment between client and remote server

@hanabi1224 hanabi1224 added this pull request to the merge queue Feb 13, 2025
Merged via the queue into main with commit e28fe8f Feb 13, 2025
35 checks passed
@hanabi1224 hanabi1224 deleted the hm/download-file-panic branch February 13, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants