Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip hidden directories when listing metadata files for indexing. #13

Merged
merged 3 commits into from
Jun 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 40 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,18 @@ we do not impose schemas on the metadata format,
and we re-use the existing storage facilities on the HPC cluster.
SewerRat can be considered a much more relaxed version of the [Gobbler](https://github.com/ArtifactDB/gobbler) that federates the storage across users.

For convenience, we'll assume that the URL to the SewerRat API is present in an environment variable named `SEWER_RAT_URL`.
Readers should obtain an appropriate URL for their SewerRat deployment before trying the code examples below.
Alternatively, readers can spin up their own instance on `localhost` by running the binaries [here](https://github.com/ArtifactDB/SewerRat/releases)
or building the executable from source with the usual `go build .` command.

## Registering a directory

### Step-by-step
### Initialization

Any directory can be indexed as long as (i) the requesting user has write access to it and (ii) the account running the SewerRat service has read access to it.
To demonstrate, let's make a directory containing JSON-formatted metadata files.
Other files may be present, of course, but SewerRat only cares about the metadata.
Metadata files can be anywhere in this directory (including within subdirectories) and they can have any base name (here, `A.json` and `B.json`).

```shell
mkdir test
Expand All @@ -31,12 +35,6 @@ echo '{ "authors": { "first": "Aaron", "last": "Lun" } }' > test/sub/A.json
echo '{ "foo": "bar", "gunk": [ "stuff", "blah" ] }' > test/sub/B.json
```

For convenience, we'll store the SewerRat API in an environment variable.

```shell
export SEWER_RAT_URL=<INSERT URL HERE> # get this from your SewerRat admin.
```

To start the registration process, we make a POST request to the `/register/start` endpoint.
This should have a JSON-encoded request body that contains the `path`, the absolute path to our directory that we want to register.

Expand All @@ -53,9 +51,12 @@ curl -X POST -L ${SEWER_RAT_URL}/register/start \

On success, this returns a `PENDING` status with a verification code.
The caller is expected to verify that they have write access to the specified directory by creating a file with the same name as the verification code (i.e., `.sewer_XXX`) inside that directory.
Once this is done, we call the `/register/finish` endpoint with a request body that contains the same directory `path`.
The body may also contain `base`, an array of strings containing the names of the files to register within the directory -
if this is not provided, only files named `metadata.json` will be registered.

### Verification

Once this is done, we call the `/register/finish` endpoint with a JSON-encoded request body that contains the same directory path in `path`.
The body may also contain `base`, an array of strings containing the names of the metadata files in the directory to be indexed.
If `base` is not provided, only files named `metadata.json` will be indexed.

```shell
curl -X POST -L ${SEWER_RAT_URL}/register/finish \
Expand All @@ -67,35 +68,34 @@ curl -X POST -L ${SEWER_RAT_URL}/register/finish \
## }
```

On success, the files in the specified directory will be registered in the SQLite index.
We can then [search on the contents of these files](#querying-the-index) or [fetch the contents of any file](#fetching-file-contents) in the registered directory.
On error, the response usually has the `application-json` content type, where the body encodes a JSON object with an `ERROR` status and a `reason` string property explaining the reason for the failure.
Note that some error types (e.g., 404, 405) may instead return a `text/plain` content type with the reason directly in the response body.
In either case, the verification code file is no longer needed after a response is received and can be deleted from the directory to reduce clutter.

We provide some small utility functions from [`scripts/functions.sh`](scripts/functions.sh) to perform the registration from the command line.
The process should still be simple enough to implement equivalent functions in any language.

### Behind the scenes

Once verified in `/register/finish`, SewerRat will walk recursively through the specified directory.
It will identify all files with the specified `base` names (i.e., `A.json` and `B.json` in our example above), parsing them as JSON for indexing.
Upon receiving a valid request, SewerRat will walk recursively through the directory specified in `path`.
It will identify all metadata files with the specified `base` names (i.e., `A.json` and `B.json` in our example above), parsing them as JSON for indexing.
SewerRat will skip any problematic files that cannot be indexed due to, e.g., invalid JSON, insufficient permissions.
The causes of any failures are reported in the `comments` array in the HTTP response.

Subdirectories with names starting with `.` are skipped during the recursive walk, so any metadata files therein will be ignored.
This is generally a sensible choice as these directories usually do not contain any interesting (scientific) information.
If any such subdirectory is relevant, a user can force SewerRat to include it in the index by passing its path directly as `path`.
This is because leading dots are allowed in the components of the supplied `path`, just not in its subdirectories.

Symbolic links in the specified directory are treated differently depending on their target.
If the directory contains symbolic links to files, the contents of the target files can be indexed as long as the link has one of the `base` names.
All file information (e.g., modification time, owner) is taken from the link target, not the link itself;
SewerRat effectively treats the symbolic link as a proxy for the target file.
If the directory contains symbolic links to other directories, these will not be recursively traversed.

On success, the metadata files in the specified directory will be incorporated into the SQLite index.
We can then [search on the contents of these files](#querying-the-index) or [fetch the contents of any file](#fetching-file-contents) in the registered directory.

### Automatic updates

SewerRat will periodically update the index by inspecting all of its registered directories for new content.
If we added or modified a file with one of the registered names (e.g., `A.json`), SewerRat will (re-)index that file.
Similarly, if we deleted a file, SewerRat will remove it from the index.
This ensures that the information in the index reflects the directory contents on the filesystem.
Users can also manually update a directory by repeating the process above to re-index the directory's contents.

As an aside: updates and symbolic links can occasionally interact in strange ways.
Updates and symbolic links can occasionally interact in strange ways.
Specifically, updates to the indexed information for symbolic links are based on the modification time of the link target.
One can imagine a pathological case where a symbolic link is changed to a different target with the same modification time as the previous target, which will not be captured by SewerRat.
Currently, this can only be resolved by deleting all affected symbolic links, re-registering the directory, and then restoring the links and re-registering again.
Expand All @@ -106,6 +106,21 @@ To remove files from the index, we use the same procedure as above but replacing
The only potential difference is when the caller requests deregistration of a directory that does not exist.
In this case, `/deregister/start` may return a `SUCCESS` status instead of `PENDING`, after which `/deregister/finish` does not need to be called.

### Other comments

If an error is encountered in the `/register/*` or `/deregister/*` endpoints, the response usually has the `application-json` content type.
The body encodes a JSON object with an `ERROR` status and a `reason` string property explaining the reason for the failure.
That said, some error types (e.g., 404, 405) may instead return a `text/plain` content type with the reason directly in the response body.

Any failure to parse specific JSON files is not considered an error and will only show up in the `comments` of a successful response from `/register/finish`.
This provides some robustness to partial writes or invalid files inside directories with complex internal structure.

Regardless of whether the registration is successful or not, the verification code file is no longer needed after a response is received.
This can be deleted from the directory to reduce clutter.

We provide some small utility functions from [`scripts/functions.sh`](scripts/functions.sh) to perform the registration from the command line.
The process should still be simple enough to implement equivalent functions in any language.

## Querying the index

### Making the request
Expand Down
8 changes: 7 additions & 1 deletion list.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import (
"io/fs"
"os"
"path/filepath"
"strings"
)

func listFiles(dir string, recursive bool) ([]string, error) {
Expand Down Expand Up @@ -59,7 +60,12 @@ func listMetadata(dir string, base_names []string) (map[string]fs.FileInfo, []st
}

if d.IsDir() {
return nil
base := filepath.Base(path)
if strings.HasPrefix(base, ".") {
return fs.SkipDir
} else {
return nil
}
}

if _, ok := curnames[filepath.Base(path)]; !ok {
Expand Down
70 changes: 68 additions & 2 deletions list_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -181,20 +181,40 @@ func TestListMetadata(t *testing.T) {
t.Fatal("unexpected file")
}
})
}

func TestListMetadataSymlink(t *testing.T) {
dir, err := os.MkdirTemp("", "")
if (err != nil) {
t.Fatalf("failed to create a temporary directory; %v", err)
}

path := filepath.Join(dir, "A.json")
err = os.WriteFile(path, []byte(""), 0644)
if err != nil {
t.Fatalf("failed to create a mock file; %v", err)
}

hostdir, err := os.MkdirTemp("", "")
hostpath := filepath.Join(hostdir, "B.json")
err = os.WriteFile(hostpath, []byte(""), 0644)
if err != nil {
t.Fatalf("failed to create a mock file; %v", err)
}

// Throwing in some symbolic links.
err = os.Symlink(path, filepath.Join(dir, "foo.json"))
if err != nil {
t.Fatal(err)
}

err = os.Symlink(subdir, filepath.Join(dir, "bar.json"))
err = os.Symlink(hostdir, filepath.Join(dir, "symlinked"))
if err != nil {
t.Fatal(err)
}

t.Run("symlink", func(t *testing.T) {
found, fails, err := listMetadata(dir, []string{ "foo.json", "bar.json" })
found, fails, err := listMetadata(dir, []string{ "foo.json", "B.json" })
if err != nil {
t.Fatal(err)
}
Expand All @@ -203,6 +223,7 @@ func TestListMetadata(t *testing.T) {
t.Fatal("unexpected failures")
}

// B.json in the linked directory should be ignored as we don't recurse into them.
if len(found) != 1 {
t.Fatal("expected exactly one file")
}
Expand All @@ -216,3 +237,48 @@ func TestListMetadata(t *testing.T) {
}
})
}

func TestListMetadataDot(t *testing.T) {
dir, err := os.MkdirTemp("", "")
if (err != nil) {
t.Fatalf("failed to create a temporary directory; %v", err)
}

path := filepath.Join(dir, "A.json")
err = os.WriteFile(path, []byte(""), 0644)
if err != nil {
t.Fatalf("failed to create a mock file; %v", err)
}

// Throwing in a hidden directory.
subdir := filepath.Join(dir, ".git")
err = os.Mkdir(subdir, 0755)
if err != nil {
t.Fatalf("failed to create a temporary subdirectory; %v", err)
}

subpath1 := filepath.Join(subdir, "A.json")
err = os.WriteFile(subpath1, []byte(""), 0644)
if err != nil {
t.Fatalf("failed to create a mock file; %v", err)
}

t.Run("dot", func(t *testing.T) {
found, fails, err := listMetadata(dir, []string{ "A.json" })
if err != nil {
t.Fatal(err)
}
if len(fails) > 0 {
t.Fatal("unexpected failures")
}

// A.json in the subdirectory should be ignored as we don't recurse into dots.
if len(found) != 1 {
t.Fatal("expected exactly one file")
}
_, ok := found[filepath.Join(dir, "A.json")]
if !ok {
t.Fatal("missing file")
}
})
}
Loading