Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch to async streaming download: #1982

Merged
merged 9 commits into from
Oct 3, 2024
Merged

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Jul 30, 2024

  • download via presigned URLs via aiohttp instead of boto APIs
  • use async methods from stream-zip to generate zip: note that stream-zip still does a sync->async conversion under the hood
  • follow-up to Implement downloading archived item + QA runs as multi-WACZ #1933 for streaming download improvements
  • fixes datapackage.json in multi-wacz to contain the same resources objects with: name, path, hash, bytes to match single WACZ.
  • Begin adding additional metadata to multi-wacz files, including type (crawl, upload, collection, qaRun), id (unique id for the object), title / description if available (for crawl/upload/collection), and crawlId for qaRun

@ikreymer ikreymer requested a review from tw4l July 30, 2024 20:06
@tw4l
Copy link
Member

tw4l commented Jul 30, 2024

When the multi-WACZs being produced in this branch are loaded into ReplayWeb.page, no seed pages or resources are listed. There may be something slightly off, investigating further.

@ikreymer ikreymer marked this pull request as draft August 6, 2024 16:44
- remove unused sync functions
- use async methods from stream-zip
- note that stream-zip still does a sync->async conversion under the hood
- follow-up to #1933 for streaming download improvements
fully remove boto
tests: update test to ensure multi-wacz resources name == path
only include name, path, hash, bytes in each resource entry!
@ikreymer
Copy link
Member Author

ikreymer commented Oct 3, 2024

Should be fixed now! Turns out the datapackage.json was not quite valid, had incorrect path in resources, not returning equal to name, and matching properties to single WACZ!

@ikreymer ikreymer marked this pull request as ready for review October 3, 2024 03:40
@tw4l
Copy link
Member

tw4l commented Oct 3, 2024

Tested on dev and working well! Nice job

…age v1:

- include 'id', but also 'title', 'description' and 'organization' fields, as well as 'crawlId' when possible
- add 'type' indicating 'crawl', 'upload', 'collection', 'qa-run'
backend/btrixcloud/crawls.py Outdated Show resolved Hide resolved
@ikreymer ikreymer merged commit 104ea09 into main Oct 3, 2024
4 checks passed
@ikreymer ikreymer deleted the async-streaming-download branch October 3, 2024 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants