Start archive downloads immediately #44

matrss · 2023-10-30T16:53:43Z

This is a follow up to #43.

Right now, gitea waits until an archive is generated before the download of that archive starts. For small repositories this is fine, but for larger repositories this can mean a rather long wait time until anything happens at all.

This PR changes the download logic to immediately "stream" the archive, that is still being generated, to the user who requested the archive. This is done by creating a (additional) temporary copy of the archive while it is being created. This temporary copy is then read from and served to the user until this temporary file is deleted, i.e. the archive generation is finished. If there are additional requests to download the same archive while it is still being generated then the same temporary file is reused.

After the archive generation finished any further requests to get the archive are handled just like they were without this change.

This approach has a few downsides:

Since the archive size is not known in advance the downloading program (e.g. the browser) can not display a meaningful progress bar.
It does not reuse the temporary file that the LocalStorage backend already creates for the archive, therefore doubling the required disk space while the archival process is running.

Still WIP.

[git-annex](https://git-annex.branchable.com/) is a more complicated cousin to git-lfs, storing large files in an optional-download side content. Unlike lfs, it allows mixing and matching storage remotes, so the content remote(s) doesn't need to be on the same server as the git remote, making it feasible to scatter a collection across cloud storage, old harddrives, or anywhere else storage can be scavenged. Since this can get complicated, fast, it has a content-tracking database (`git annex whereis`) to help find everything later. The use-case we imagine for including it in Gitea is just the simple case, where we're primarily emulating git-lfs: each repo has its large content at the same URL. Our motivation is so we can self-host https://www.datalad.org/ datasets, which currently are only hostable by fragilely scrounging together cloud storage -- and having to manage all the credentials associated with all the pieces -- or at https://openneuro.org which is fragile in its own ways. Supporting git-annex also allows multiple Gitea instance to be annex remotes for each other, mirroring the content or otherwise collaborating the split up the hosting costs. Enabling -------- TODO HTTP ---- TODO Permission Checking ------------------- This tweaks the API in routers/private/serv.go to expose the calling user's computed permission, instead of just returning HTTP 403. This doesn't fit in super well. It's the opposite from how the git-lfs support is done, where there's a complete list of possible subcommands and their matching permission levels, and then the API compares the requested with the actual level and returns HTTP 403 if the check fails. But it's necessary. The main git-annex verbs, 'git-annex-shell configlist' and 'git-annex-shell p2pstdio' are both either read-only or read-write operations, depending on the state on disk on either end of the connection and what the user asked it to ask for, with no way to know before git-annex examines the situation. So tell the level via GIT_ANNEX_READONLY and trust it to handle itself. In the older Gogs version, the permission was directly read in cmd/serv.go: ``` mode, err = db.UserAccessMode(user.ID, repo) ``` - https://github.com/G-Node/gogs/blob/966e925cf320beff768b192276774d9265706df5/internal/cmd/serv.go#L334 but in Gitea permission enforcement has been centralized in the API layer. (perhaps so the cmd layer can avoid making direct DB connections?) Deletion -------- git-annex has this "lockdown" feature where it tries really quite very hard to prevent you deleting its data, to the point that even an rm -rf won't do it: each file in annex/objects/ is nested inside a folder with read-only permissions. The recommended workaround is to run chmod -R +w when you're sure you actually want to delete a repo. See https://git-annex.branchable.com/internals/lockdown So we edit util.RemoveAll() to do just that, so now it's `chmod -R +w && rm -rf` instead of just `rm -rf`.

Fixes neuropoly#11 Tests: * `git annex init` * `git annex copy --from origin` * `git annex copy --to origin` over: * ssh for: * the owner * a collaborator * a read-only collaborator * a stranger in a * public repo * private repo And then confirms: * Deletion of the remote repo (to ensure lockdown isn't messing with us: https://git-annex.branchable.com/internals/lockdown/#comment-0cc5225dc5abe8eddeb843bfd2fdc382) ------ To support all this: * Add util.FileCmp() * Patch withKeyFile() so it can be nested in other copies of itself ------- Many thanks to Mathieu for giving style tips and catching several bugs, including a subtle one in util.filecmp() which neutered it. Co-authored-by: Mathieu Guay-Paquet <[email protected]>

Fixes neuropoly#8 Co-authored-by: Mathieu Guay-Paquet <[email protected]>

This makes HTTP symmetric with SSH clone URLs. This gives us the fancy feature of _anonymous_ downloads, so people can access datasets without having to set up an account or manage ssh keys. Previously, to access "open access" data shared this way, users would need to: 1. Create an account on gitea.example.com 2. Create ssh keys 3. Upload ssh keys (and make sure to find and upload the correct file) 4. `git clone [email protected]:user/dataset.git` 5. `cd dataset` 6. `git annex get` This cuts that down to just the last three steps: 1. `git clone https://gitea.example.com/user/dataset.git` 2. `cd dataset` 3. `git annex get` This is significantly simpler for downstream users, especially for those unfamiliar with the command line. Unfortunately there's no uploading. While git-annex supports uploading over HTTP to S3 and some other special remotes, it seems to fail on a _plain_ HTTP remote. See neuropoly#7 and https://git-annex.branchable.com/forum/HTTP_uploads/#comment-ce28adc128fdefe4c4c49628174d9b92. This is not a major loss since no one wants uploading to be anonymous anyway. To support private repos, I had to hunt down and patch a secret extra security corner that Gitea only applies to HTTP for some reason (services/auth/basic.go). This was guided by https://git-annex.branchable.com/tips/setup_a_public_repository_on_a_web_site/ Fixes neuropoly#3 Co-authored-by: Mathieu Guay-Paquet <[email protected]>

This moves the `annexObjectPath()` helper out of the tests and into a dedicated sub-package as `annex.ContentLocation()`, and expands it with `.Pointer()` (which validates using `git annex examinekey`), `.IsAnnexed()` and `.Content()` to make it a more useful module. The tests retain their own wrapper version of `ContentLocation()` because I tried to follow close to the API modules/lfs uses, which in terms of abstract `git.Blob` and `git.TreeEntry` objects, not in terms of `repoPath string`s which are more convenient for the tests.

Previously, Gitea's LFS support allowed direct-downloads of LFS content, via http://$HOSTNAME:$PORT/$USER/$REPO/media/branch/$BRANCH/$FILE Expand that grace to git-annex too. Now /media should provide the relevant *content* from the .git/annex/objects/ folder. This adds tests too. And expands the tests to try symlink-based annexing, since /media implicitly supports both that and pointer-file-based annexing.

This updates the repo index/file view endpoints so annex files match the way LFS files are rendered, making annexed files accessible via the web instead of being black boxes only accessible by git clone. This mostly just duplicates the existing LFS logic. It doesn't try to combine itself with the existing logic, to make merging with upstream easier. If upstream ever decides to accept, I would like to try to merge the redundant logic. The one bit that doesn't directly copy LFS is my choice to hide annex-symlinks. LFS files are always _pointer files_ and therefore always render with the "file" icon and no special label, but annex files come in two flavours: symlinks or pointer files. I've conflated both kinds to try to give a consistent experience. The tests in here ensure the correct download link (/media, from the last PR) renders in both the toolbar and, if a binary file (like most annexed files will be), in the main pane, but it also adds quite a bit of code to make sure text files that happen to be annexed are dug out and rendered inline like LFS files are.

Upstream can handle the full test suite; to avoid tedious waiting, we only test the code added in this fork.

This extends the archive creation logic to add annexed files to the created archives. The basic flow is this: 1. Create an archive using `git archive` 2. Read in that archive and write out a new one, replacing all annexed files with their annexed content; leaving the git-only files as-is The file permissions with which the annexed files are put into the archive are decided based on what `git archive` does for other files as well: - For tar.gz archives, executable files get permissions 0775 and regular files get 0664. - For zip archives, executable files get permissions 0755 and regular files are archived with "FAT permissions" rw, instead of unix permissions. If for a given archive request an annexed file is not present on the gitea instance then the content as tracked by git (i.e. a symlink or pointer file) is silently put into the resulting archive instead. Co-authored-by: Nick Guenther <[email protected]>

Tests include: - Compare the list of files in the resulting archive with the list of files as tracked in the archived git tree. - Compare the content of each file with what it should be (git blob content or the annexed file, respectively). - Check that the file mode matches the expected file mode for all archived files. - Check that the resulting archive has the archived commitID set as a comment (as `git archive` does as well). The tests are done for both the "web" endpoints at `/<user>/<repo>/archive/<git-ref>.{tar.gz,zip}` and the "api-v1" endpoints at `/api/v1/<user>/<repo>/archive/<git-ref>.{tar.gz,zip}`.

This commit can be dropped as soon as go-gitea#27563 is accepted.

kousu and others added 8 commits October 30, 2023 13:11

git-annex: add configuration setting [annex].ENABLED (neuropoly#18)

8312d54

Fixes neuropoly#8 Co-authored-by: Mathieu Guay-Paquet <[email protected]>

git-annex: Only run git-annex tests.

2fccb6b

Upstream can handle the full test suite; to avoid tedious waiting, we only test the code added in this fork.

github-actions bot added modifies/frontend modifies/api labels Oct 30, 2023

matrss and others added 4 commits October 31, 2023 13:37

git-annex: do not block database in doArchive

3e18b6c

This commit can be dropped as soon as go-gitea#27563 is accepted.

git-annex: start archive downloads immediately

5f4d0db

matrss force-pushed the enable-immediate-archive-downloads branch from b2a5c97 to 5f4d0db Compare October 31, 2023 12:55

gitea-sync bot force-pushed the git-annex branch 3 times, most recently from a113681 to ede984a Compare November 2, 2023 13:11

kousu force-pushed the git-annex branch from ede984a to af35424 Compare November 4, 2023 16:01

gitea-sync bot force-pushed the git-annex branch 7 times, most recently from 81f56b1 to d75a62f Compare November 11, 2023 13:09

gitea-sync bot force-pushed the git-annex branch 3 times, most recently from d952a4b to 1790aeb Compare November 14, 2023 13:12

kousu force-pushed the git-annex branch from 1790aeb to e5f2899 Compare November 20, 2023 12:41

kousu force-pushed the git-annex branch from e5f2899 to 956db63 Compare November 29, 2023 03:54

kousu requested a review from mguaypaq March 6, 2024 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start archive downloads immediately #44

Start archive downloads immediately #44

matrss commented Oct 30, 2023

Start archive downloads immediately #44

Are you sure you want to change the base?

Start archive downloads immediately #44

Conversation

matrss commented Oct 30, 2023