Skip to content

Commit

Permalink
Refactor tidy_unzip()
Browse files Browse the repository at this point in the history
Fixes #1961, closes #1962.

Analyzing the above gave me a new/different understanding of common ZIP file organization structures. I think I've got a better heuristic for setting `exdir`.
  • Loading branch information
jennybc committed Jul 27, 2024
1 parent 2248463 commit 1f20512
Show file tree
Hide file tree
Showing 14 changed files with 494 additions and 165 deletions.
34 changes: 18 additions & 16 deletions R/course.R
Original file line number Diff line number Diff line change
Expand Up @@ -360,15 +360,15 @@ tidy_unzip <- function(zipfile, cleanup = FALSE) {
## DropBox ZIP files often include lots of hidden R, RStudio, and Git files
filenames <- filenames[keep_lgl(filenames)]

td <- top_directory(filenames)
loose_parts <- is.na(td)

if (loose_parts) {
parents <- path_before_slash(filenames)
unique_parents <- unique(parents)
if (length(unique_parents) == 1 && unique_parents != "") {
target <- path(base_path, unique_parents)
utils::unzip(zipfile, files = filenames, exdir = base_path)
} else {
# there is no parent; archive contains loose parts
target <- path_ext_remove(zipfile)
utils::unzip(zipfile, files = filenames, exdir = target)
} else {
target <- path(base_path, td)
utils::unzip(zipfile, files = filenames, exdir = base_path)
}
ui_bullets(c(
"v" = "Unpacking ZIP file into {.path {pth(target, base_path)}}
Expand Down Expand Up @@ -398,7 +398,7 @@ tidy_unzip <- function(zipfile, cleanup = FALSE) {
}
}

invisible(target)
invisible(unclass(target))
}

#' @rdname use_course_details
Expand Down Expand Up @@ -525,15 +525,17 @@ keep_lgl <- function(file,
!grepl(ignores, file, perl = TRUE)
}

top_directory <- function(filenames) {
in_top <- path_dir(filenames) == "."
unique_top <- unique(filenames[in_top])
is_directory <- grepl("/$", unique_top)
if (length(unique_top) > 1 || !is_directory) {
NA_character_
} else {
unique_top
path_before_slash <- function(filepath) {
f <- function(x) {
parts <- strsplit(x, "/", fixed = TRUE)[[1]]
if (length(parts) > 1 || grepl("/", x)) {
parts[1]
} else {
""
}
}
purrr::map_chr(filepath, f)

}

content_type <- function(h) {
Expand Down
192 changes: 151 additions & 41 deletions tests/testthat/ref/README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,66 +14,169 @@ library(fs)

## Different styles of ZIP file

Examples based on foo folder found here.
usethis has an unexported function `tidy_unzip()`, which is used under the hood in `use_course()` and `use_zip()`.
It is a wrapper around `utils::unzip()` that uses some heuristics to choose a good value for `exdir`, which is the "the directory to extract files to."
Why do we do this?
Because it's really easy to _not_ get the desired result when unpacking a ZIP archive.

Common aggravations:

* Instead of the unpacked files being corraled within a folder, they explode as "loose parts" into the current working directory. Too little nesting.
* The unpacked files are contained in a folder, but that folder itself is contained inside another folder. Too much nesting.

`tidy_unzip()` tries to get the nesting just right.

Why doesn't unzipping "just work"?
Because the people who make `.zip` files make lots of different choices when they actually create the archive and these details aren't baked in, i.e. a successful roundtrip isn't automatic.
It usually requires some peeking inside the archive and adjusting the unpack options.

This README documents specific `.zip` situations that we anticipate.

## Explicit parent folder

Consider the foo folder:

```{bash}
tree foo
```

### Not Loose Parts, a.k.a. GitHub style
Zip it up like so:

```{bash, eval = FALSE}
zip -r foo-explicit-parent.zip foo/
```

This is the type of ZIP file that we get from GitHub via links of the forms <https://github.com/r-lib/usethis/archive/main.zip> and <http://github.com/r-lib/usethis/zipball/main/>.

Inspect it in the shell:

```{bash}
unzip -Z1 foo-explicit-parent.zip
```

Or from R:

```{r}
foo_files <- unzip("foo-explicit-parent.zip", list = TRUE)
with(
foo_files,
data.frame(Name = Name, dirname = path_dir(Name), basename = path_file(Name))
)
```

Note that the folder `foo/` is explicitly included and all of the files are contained in it (in this case, just one file).

## Implicit parent folder

Consider the foo folder:

```{bash}
tree foo
```

This is the structure of ZIP files yielded by GitHub via links of the forms <https://github.com/r-lib/usethis/archive/master.zip> and <http://github.com/r-lib/usethis/zipball/master/>.
Zip it up like so:

```{bash, eval = FALSE}
zip -r foo-not-loose.zip foo/
zip -r foo-implicit-parent.zip foo/*
```

Notice that everything is packaged below one top-level directory.
Note the use of `foo/*`, as opposed to `foo` or `foo/`.
This type of ZIP file was reported in <https://github.com/r-lib/usethis/issues/1961>.
The example given there is <https://agdatacommons.nal.usda.gov/ndownloader/files/44576230>.

Inspect our small example in the shell:

```{bash}
unzip -Z1 foo-implicit-parent.zip
```

Or from R:

```{r}
foo_not_loose_files <- unzip("foo-not-loose.zip", list = TRUE)
foo_files <- unzip("foo-implicit-parent.zip", list = TRUE)
with(
foo_not_loose_files,
foo_files,
data.frame(Name = Name, dirname = path_dir(Name), basename = path_file(Name))
)
```

### Loose Parts, the Regular Way
Note that `foo/` is not included and its (original) existence is just implicit in the relative path to, e.g.,
`foo/file.txt`.

Here's a similar look at the example from issue #1961:

```bash
~/rrr/usethis/tests/testthat/ref % unzip -l ~/Downloads/Species\ v2.3.zip
Archive: /Users/jenny/Downloads/Species v2.3.zip
Length Date Time Name
--------- ---------- ----- ----
1241 04-27-2023 09:16 species_v2/label_encoder.txt
175187560 04-06-2023 15:13 species_v2/model_arch.pt
174953575 04-14-2023 12:28 species_v2/model_weights.pth
--------- -------
350142376 3 files
```

Note also that the implicit parent folder `species_v2` is not the base of the ZIP file name `Species v2.3.zip`.

This is the structure of many ZIP files I've seen, just in general.
## No parent

Consider the foo folder:

```{bash}
tree foo
```

Zip it up like so:

```{bash, eval = FALSE}
cd foo
zip ../foo-loose-regular.zip *
cd ..
(cd foo && zip -r ../foo-no-parent.zip .)
```

All the files are packaged in the ZIP archive as "loose parts", i.e. there is no explicit top-level directory.
Note that we are zipping everything in a folder from *inside* the folder.

Inspect our small example in the shell:

```{bash}
unzip -Z1 foo-no-parent.zip
```

Or from R:

```{r}
foo_loose_regular_files <- unzip("foo-loose-regular.zip", list = TRUE)
foo_files <- unzip("foo-no-parent.zip", list = TRUE)
with(
foo_loose_regular_files,
foo_files,
data.frame(Name = Name, dirname = path_dir(Name), basename = path_file(Name))
)
```

### Loose Parts, the DropBox Way
All the files are packaged in the ZIP archive as "loose parts", i.e. there is no explicit or implicit top-level directory.

## No parent, the DropBox Variation

This is the structure of ZIP files yielded by DropBox via links of this form <https://www.dropbox.com/sh/12345abcde/6789wxyz?dl=1>. I can't figure out how to even do this with zip locally, so I had to create an example on DropBox and download it. Jim Hester reports it is possible with `archive::archive_write_files()`.

<https://www.dropbox.com/sh/5qfvssimxf2ja58/AABz3zrpf-iPYgvQCgyjCVdKa?dl=1>

It's basically like the "loose parts" above, except it includes a spurious top-level directory `"/"`.
It's basically like the "no parent" example above, except it includes a spurious top-level directory `"/"`.

Inspect our small example in the shell:

```{bash}
unzip -Z1 foo-loose-dropbox.zip
```

Or from R:

```{r}
# curl::curl_download(
# "https://www.dropbox.com/sh/5qfvssimxf2ja58/AABz3zrpf-iPYgvQCgyjCVdKa?dl=1",
# destfile = "foo-loose-dropbox.zip"
# )
foo_loose_dropbox_files <- unzip("foo-loose-dropbox.zip", list = TRUE)
foo_files <- unzip("foo-loose-dropbox.zip", list = TRUE)
with(
foo_loose_dropbox_files,
foo_files,
data.frame(Name = Name, dirname = path_dir(Name), basename = path_file(Name))
)
```
Expand All @@ -87,43 +190,50 @@ mapname: conversion of failed
inflating: file.txt
```

So this is a pretty odd ZIP packing strategy. But we need to plan for it.
which indicates some tripping over the `/`.
Only `file.txt` is left behind.

## Subdirs only at top-level
This is a pretty odd ZIP packing strategy. But we need to plan for it.

Let's make sure we detect loose parts (or not) when the top-level has only directories, not files.
## Only subdirectories

Example based on the yo directory here:
For testing purposes, we also want an example where all the files are inside subdirectories.

Examples based on the yo directory here:

```{bash}
tree yo
```

```{bash, eval = FALSE}
zip -r yo-not-loose.zip yo/
```

```{r}
(yo_not_loose_files <- unzip("yo-not-loose.zip", list = TRUE))
top_directory(yo_not_loose_files$Name)
```
Zip it up, in all the usual ways:

```{bash, eval = FALSE}
cd yo
zip -r ../yo-loose-regular.zip *
cd ..
zip -r yo-explicit-parent.zip yo/
zip -r yo-implicit-parent.zip yo/*
(cd yo && zip -r ../yo-no-parent.zip .)
```

```{r}
(yo_loose_regular_files <- unzip("yo-loose-regular.zip", list = TRUE))
top_directory(yo_loose_regular_files$Name)
```
Again, I couldn't create the DropBox variant locally, so I did it by downloading from DropBox.

```{r}
```{r eval = FALSE}
# curl::curl_download(
# "https://www.dropbox.com/sh/afydxe6pkpz8v6m/AADHbMZAaW3IQ8zppH9mjNsga?dl=1",
# destfile = "yo-loose-dropbox.zip"
# )
(yo_loose_dropbox_files <- unzip("yo-loose-dropbox.zip", list = TRUE))
top_directory(yo_loose_dropbox_files$Name)
```

Inspect each in the shell:

```{bash}
unzip -Z1 yo-explicit-parent.zip
```

```{bash}
unzip -Z1 yo-implicit-parent.zip
```
```{bash}
unzip -Z1 yo-no-parent.zip
```
```{bash}
unzip -Z1 yo-loose-dropbox.zip
```
Loading

0 comments on commit 1f20512

Please sign in to comment.