Skip to content

Commit

Permalink
Fleshed out the README.
Browse files Browse the repository at this point in the history
  • Loading branch information
LTLA committed Oct 21, 2024
1 parent 4188bd0 commit 94ee526
Showing 1 changed file with 23 additions and 6 deletions.
29 changes: 23 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,12 @@ It will identify all metadata files with the specified `base` names (i.e., `A.js
SewerRat will skip any problematic files that cannot be indexed due to, e.g., invalid JSON, insufficient permissions.
The causes of any failures are reported in the `comments` array in the HTTP response.

On success, the metadata files in the specified directory will be incorporated into the SQLite index.
We can then [search on the contents of these files](#querying-the-index) or [fetch the contents of any file](#fetching-file-contents) in the registered directory.

### Indexing in detail

As mentioned above, SewerRat will recurse through the specified directory to find metadata files with the listed `base` names.
Subdirectories with names starting with `.` are skipped during the recursive walk, so any metadata files therein will be ignored.
This is generally a sensible choice as these directories usually do not contain any interesting (scientific) information.
If any such subdirectory is relevant, a user can force SewerRat to include it in the index by passing its path directly as `path`.
Expand All @@ -86,8 +92,21 @@ All file information (e.g., modification time, owner) is taken from the link tar
SewerRat effectively treats the symbolic link as a proxy for the target file.
If the directory contains symbolic links to other directories, these will not be recursively traversed.

On success, the metadata files in the specified directory will be incorporated into the SQLite index.
We can then [search on the contents of these files](#querying-the-index) or [fetch the contents of any file](#fetching-file-contents) in the registered directory.
Each identified metadata document is parsed as JSON and converted into tokens.
For strings, we use an adaptation of the [FTS5 Unicode61 tokenizer](https://www.sqlite.org/fts5.html#unicode61_tokenizer) to break each string into tokens,
i.e., strings are split into tokens at any character that is not a Unicode letter/number or a dash.
For numbers and booleans, the string representation of the value is tokenized.
All tokens are stored in the index, associated with the JSON object properties in which it was found,
e.g., the value `"Chris"` is associated with the properties `"b.c"` in the document below.

```json
{
"a": "Aaron",
"b": {
"c": "Chris"
}
}
```

### Automatic updates

Expand Down Expand Up @@ -176,10 +195,8 @@ The nature of the search depends on the value of `type`:
- For `"text"`, SewerRat searches on the text (i.e., any string property) in the metadata file.
The search clause should contain the following additional properties:
- `text`, the search string.
We use an adaptation of the [FTS5 Unicode61 tokenizer](https://www.sqlite.org/fts5.html#unicode61_tokenizer) to process all strings in the metadata files,
i.e., strings are split into tokens at any character that is not a Unicode letter/number or a dash.
The same process is applied to the string in `text`.
All tokens in `text` must match to a token in the metadata file in order for that file to be considered a match.
The tokenization process described [above](#indexing-in-detail) is applied to this string to create tokens.
All tokens in `text` must be present in the metadata file in order for that file to be considered a match.
- (optional) `field`, the name of the metadata property to be matched.
Matches to tokens are only considered within the named property.
Properties of nested objects can be specified via `.`-delimited names, e.g., `authors.first`.
Expand Down

0 comments on commit 94ee526

Please sign in to comment.