Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement downloading archived item + QA runs as multi-WACZ #1933

Merged
merged 18 commits into from
Jul 25, 2024

Conversation

tw4l
Copy link
Member

@tw4l tw4l commented Jul 16, 2024

Fixes #1412

Changes

Backend

  • Adds all-crawls, crawls, and uploads API endpoints to download archived item as multi-WACZ
  • Adds backend tests for new endpoints

Frontend

Adds ability to download archived item from:

  • Button in archived item detail Files tab
  • Archived item details actions menu
  • Archived items list menu

Screenshots

Files tab

Screenshot 2024-07-16 at 1 38 12 PM

Detail actions menu

Screenshot 2024-07-16 at 1 40 06 PM

Archived items list menu

Screenshot 2024-07-16 at 1 37 32 PM

tw4l added 8 commits July 16, 2024 13:02
Not working yet - doesn't trigger download
This commit removes some safety checks on the btrix-dropdown
containing the menu items. In Brave, Chrome, and Firefox, the
behavior that the checks were supposed to guard against aren't
happening after the change.
Copy link
Member

@emma-sg emma-sg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! We could stand to remove the other preventDefault calls on click events on overflow dropdowns in list views too, since I think now that we've migrated to the new tables we're using links that don't cover the dropdown menu any more, but that could be a cleanup item I suppose.

@tw4l
Copy link
Member Author

tw4l commented Jul 16, 2024

Looks good to me! We could stand to remove the other preventDefault calls on click events on overflow dropdowns in list views too, since I think now that we've migrated to the new tables we're using links that don't cover the dropdown menu any more, but that could be a cleanup item I suppose.

Yeah I think maybe that's best as a separate cleanup task? Thank you!

@tw4l
Copy link
Member Author

tw4l commented Jul 17, 2024

@SuaYoo thanks for catching the typo!

@SuaYoo SuaYoo self-requested a review July 17, 2024 01:11
@ikreymer ikreymer changed the title Implement downloading archived item as multi-WACZ Implement downloading archived item + QA runs as multi-WACZ Jul 20, 2024
@ikreymer
Copy link
Member

Also added download of QA run data as multi-WACZ as well, as now we are downloading only the first WACZ

@tw4l
Copy link
Member Author

tw4l commented Jul 22, 2024

Also added download of QA run data as multi-WACZ as well, as now we are downloading only the first WACZ

Thanks for this!

@tw4l
Copy link
Member Author

tw4l commented Jul 23, 2024

@SuaYoo mind taking another look at this when you have a moment?

Copy link
Member

@ikreymer ikreymer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI and other changes look good - (and hopefully the QA downloads s well).

Investigating potential crc32 mismatches in the generated WACZs, though I think that may have to do with the streaming zip library we're using and not related to the changes in this PR itself.

Copy link
Member

@Shrinks99 Shrinks99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the UI, looks good, haven't tested the feature myself. Other issues are best dealt with by other folks, approving my end of things! :)

- update to a new version 'stream-zip' which does not require crc32 to produce a stream WACZ
- don't specify crc32 in streaming WACZ as it may be incorrect or missing, compute on the fly
- update types for latest 'stream-zip'
@ikreymer
Copy link
Member

Investigating potential crc32 mismatches in the generated WACZs, though I think that may have to do with the streaming zip library we're using and not related to the changes in this PR itself.

I believe this should now be fixed, with a new custom fork of stream-zip! Crawler was actually computing crc32 incorrectly, with latest fix, not requiring crc32 at all

@ikreymer
Copy link
Member

One other thing to consider: what do we want the default names of the WACZs to be? Should it just be the crawl id, or perhaps the workflow name to be more user-friendly? (Can also revisit later)

@tw4l
Copy link
Member Author

tw4l commented Jul 25, 2024

One other thing to consider: what do we want the default names of the WACZs to be? Should it just be the crawl id, or perhaps the workflow name to be more user-friendly? (Can also revisit later)

I think crawl id is okay for now? It's similar to the WACZ filenames in the Files list (with prefix and without the instance suffix). With the workflow name it might get a bit long?

@ikreymer ikreymer merged commit 27ee16d into main Jul 25, 2024
7 checks passed
@ikreymer ikreymer deleted the issue-1412-download-archived-item-btn branch July 25, 2024 17:29
ikreymer added a commit that referenced this pull request Jul 30, 2024
- remove unused sync functions
- use async methods from stream-zip
- note that stream-zip still does a sync->async conversion under the hood
- follow-up to #1933 for streaming download improvements
ikreymer added a commit that referenced this pull request Oct 3, 2024
- remove unused sync functions
- use async methods from stream-zip
- note that stream-zip still does a sync->async conversion under the hood
- follow-up to #1933 for streaming download improvements
ikreymer added a commit that referenced this pull request Oct 3, 2024
…s: (#1982)

- download via presigned URLs via requests instead of boto APIs, remove boto
- follow-up to #1933 for streaming download improvements
- fixes datapackage.json in multi-wacz to contain the same resources
objects with: `name`, `path`, `hash`, `bytes` to match single WACZ.
- Add additional metadata to multi-wacz datapackage.json, including `type`
(`crawl`, `upload`, `collection`, `qaRun`), `id` (unique id for the
object), `title` / `description` if available (for
crawl/upload/collection), and `crawlId` for `qaRun`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Download Archived Item button
5 participants