switch to async streaming download: #1982

ikreymer · 2024-07-30T20:06:01Z

download via presigned URLs via aiohttp instead of boto APIs
use async methods from stream-zip to generate zip: note that stream-zip still does a sync->async conversion under the hood
follow-up to Implement downloading archived item + QA runs as multi-WACZ #1933 for streaming download improvements
fixes datapackage.json in multi-wacz to contain the same resources objects with: name, path, hash, bytes to match single WACZ.
Begin adding additional metadata to multi-wacz files, including type (crawl, upload, collection, qaRun), id (unique id for the object), title / description if available (for crawl/upload/collection), and crawlId for qaRun

tw4l · 2024-07-30T21:05:18Z

When the multi-WACZs being produced in this branch are loaded into ReplayWeb.page, no seed pages or resources are listed. There may be something slightly off, investigating further.

- remove unused sync functions - use async methods from stream-zip - note that stream-zip still does a sync->async conversion under the hood - follow-up to #1933 for streaming download improvements

fully remove boto tests: update test to ensure multi-wacz resources name == path

only include name, path, hash, bytes in each resource entry!

ikreymer · 2024-10-03T03:40:23Z

Should be fixed now! Turns out the datapackage.json was not quite valid, had incorrect path in resources, not returning equal to name, and matching properties to single WACZ!

tw4l · 2024-10-03T14:28:55Z

Tested on dev and working well! Nice job

…not require boto

…age v1: - include 'id', but also 'title', 'description' and 'organization' fields, as well as 'crawlId' when possible - add 'type' indicating 'crawl', 'upload', 'collection', 'qa-run'

backend/btrixcloud/crawls.py

ikreymer requested a review from tw4l July 30, 2024 20:06

ikreymer marked this pull request as draft August 6, 2024 16:44

ikreymer added 2 commits October 2, 2024 19:37

switch to async streaming download:

9df816b

- remove unused sync functions - use async methods from stream-zip - note that stream-zip still does a sync->async conversion under the hood - follow-up to #1933 for streaming download improvements

fix path == name in datapackage.json, to match previous behavior

6036319

fully remove boto tests: update test to ensure multi-wacz resources name == path

ikreymer force-pushed the async-streaming-download branch from 630255c to 6036319 Compare October 3, 2024 03:13

ikreymer added 3 commits October 2, 2024 20:28

fix multi-wacz datapackage.json:

68deb3f

only include name, path, hash, bytes in each resource entry!

update test

62dd80b

fix missing import

cf6b7b0

ikreymer marked this pull request as ready for review October 3, 2024 03:40

tw4l approved these changes Oct 3, 2024

View reviewed changes

ikreymer added 3 commits October 3, 2024 09:38

switch back to mostly sync download with requests, but simplified to …

c7babf9

…not require boto

add additional metadata for multi-wacz to be conformant with datapack…

4e5c5ed

…age v1: - include 'id', but also 'title', 'description' and 'organization' fields, as well as 'crawlId' when possible - add 'type' indicating 'crawl', 'upload', 'collection', 'qa-run'

add 'software' to metadata

124f4c1

tw4l reviewed Oct 3, 2024

View reviewed changes

backend/btrixcloud/crawls.py Outdated Show resolved Hide resolved

qa_run -> qaRun

fceafdc

ikreymer merged commit 104ea09 into main Oct 3, 2024
4 checks passed

ikreymer deleted the async-streaming-download branch October 3, 2024 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch to async streaming download: #1982

switch to async streaming download: #1982

ikreymer commented Jul 30, 2024 •

edited

Loading

tw4l commented Jul 30, 2024

ikreymer commented Oct 3, 2024

tw4l commented Oct 3, 2024

switch to async streaming download: #1982

switch to async streaming download: #1982

Conversation

ikreymer commented Jul 30, 2024 • edited Loading

tw4l commented Jul 30, 2024

ikreymer commented Oct 3, 2024

tw4l commented Oct 3, 2024

ikreymer commented Jul 30, 2024 •

edited

Loading