feat(docs): Prepro docs for administrators (#2440)

* docs: add a list of existing preprocessing pipelines * Update docs about nextclade prepro pipeline. * Update the specification.md file. --------- Co-authored-by: Chaoran Chen <[email protected]>
loculus-project · Aug 16, 2024 · dcd5042 · dcd5042
1 parent 498b653
commit dcd5042
Show file tree

Hide file tree

Showing 5 changed files with 83 additions and 21 deletions.
diff --git a/docs/astro.config.mjs b/docs/astro.config.mjs
@@ -45,6 +45,10 @@ export default defineConfig({
                         { label: 'Getting started', link: '/for-administrators/getting-started/' },
                         { label: 'Setup with Kubernetes', link: '/for-administrators/setup-with-kubernetes/' },
                         { label: 'Schema designs', link: '/for-administrators/schema-designs/' },
+                        {
+                            label: 'Existing preprocessing pipelines',
+                            link: '/for-administrators/existing-preprocessing-pipelines/',
+                        },
                         { label: 'User administration', link: '/for-administrators/user-administration/' },
                     ],
                 },

diff --git a/docs/src/content/docs/for-administrators/existing-preprocessing-pipelines.md b/docs/src/content/docs/for-administrators/existing-preprocessing-pipelines.md
@@ -0,0 +1,38 @@
+---
+title: Existing preprocessing pipelines
+---
+
+[Preprocessing pipelines](../../introduction/glossary/#preprocessing-pipeline) hold most of the organism- and domain-specific logic within a Loculus instance. They take the submitted input data and, as a minimum, validate them to ensure that the submitted data follow the defined format. Additionally, they can clean the data and enrich them by adding annotations and sequence alignments.
+
+The Loculus team maintain a customizable processing pipeline which uses [Nextclade](../../introduction/glossary/#nextclade) to align sequences to a reference and generate statistics, which is discussed in more detail below.
+
+Using an existing pipeline is the fastest way to get started with Loculus, but it is also easy to develop new pipelines that use custom tooling and logic. The [preprocessing pipeline specification](https://github.com/loculus-project/loculus/blob/main/preprocessing/specification.md) describes the interface between a pipeline and the [Loculus backend server](../introduction/glossary.md#backend) and you can take a look at the code of the ["dummy pipeline"](https://github.com/loculus-project/loculus/tree/main/preprocessing/dummy) and the [Nextclade-based pipeline](https://github.com/loculus-project/loculus/tree/main/preprocessing/nextclade) (both examples are written in Python but it is possible to implement preprocessing pipelines in any programming language).
+
+If you have developed a pipeline and would like it to be added to this list, please contact us!
+
+## Nextclade-based pipeline
+
+_Maintained by the Loculus team_
+
+This pipeline supports all [schemas](../introduction/glossary/#schema) where each segment has one unique reference that it should be aligned to, e.g. the [one organism, multi-segment schema](./schema-designs.md#one-organism-for-everything) and the [multi-organism schema](./schema-designs.md#multiple-clearly-separated-organisms-each-with-one-reference).
+
+This pipeline uses [nextclade run](https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli/reference.html#nextclade-run) for alignment, mutation calling, and quality checks. It relies on an existing [Nextclade dataset](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html) with the same reference genome as the one used by Loculus, `nextclade` will also perform clade assignment and phylogenetic placement if the `dataset` includes this information. To use this pipeline for new pathogens, check if there is already an existing nextclade dataset for that pathogen [here](https://github.com/nextstrain/nextclade_data/tree/master/data), or follow the steps in the [dataset creation guide](https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-creation-guide.md) to create a new dataset. For example for mpox we use [nextstrain/mpox/all-clades](https://github.com/nextstrain/nextclade_data/tree/master/data/nextstrain/mpox/all-clades), defined in the `values.yaml` as:
+
+```yaml
+preprocessing:
+    - configFile:
+        nextclade_dataset_name: nextstrain/mpox/all-clades
+```
+
+Additionally the pipeline performs checks on the metadata fields. The checks are defined by custom preprocessing functions in the `values.yaml` file. These checks can be applied to and customized for other metadata fields, see [Preprocessing Checks](https://github.com/loculus-project/loculus/blob/main/preprocessing/nextclade/README.md#preprocessing-checks) for more info.
+
+In the default configuration the pipeline performs: 
+ * **type checks**: Checks that the type of each metadata field corresponds to the expected `type` value seen in the config (default is string).
+ * **required value checks**: Checks that if a field is required, e.g. `required` field in config is true, that that field is not None.
+ * **INSDC-accepted country checks**: Using the `process_options` preprocessing function checks that the `geo_loc_country` field is set to an [INSDC-accepted country](https://www.ebi.ac.uk/ena/browser/api/xml/ERC000011) option. 
+
+The pipeline also formats metadata fields:
+ * **process date**: Takes a date string and returns a date field in the "%Y-%m-%d" format.
+ * **parse timestamp**: Takes a timestamp e.g. 2022-11-01T00:00:00Z and returns that field in the "%Y-%m-%d" format.
+
+The code is available on [GitHub](https://github.com/loculus-project/loculus/tree/main/preprocessing/nextclade) under the [AGPL-3.0 license](https://github.com/loculus-project/loculus/blob/main/LICENSE).
diff --git a/docs/src/content/docs/for-administrators/getting-started.md b/docs/src/content/docs/for-administrators/getting-started.md
@@ -13,6 +13,10 @@ Although Loculus is in principle stable and can be used in production, we plan t
 
 The first step to setting up a new Loculus instance is to define its schema. You can read about [a few example schemas](../schema-designs).
 
+## Choose a preprocessing pipeline
+
+You can choose an [existing preprocessing pipeline](../existing-preprocessing-pipelines) or build a new one following the [preprocessing pipeline specification](https://github.com/loculus-project/loculus/blob/main/preprocessing/specification.md).
+
 ## Installation
 
 As presented in the [system overview](../../introduction/system-overview) (which we recommend reading), Loculus consists of numerous sub-services which need to be configured and wired together. All services are available as Docker images. For local development and the preview instances, we use Kubernetes and Helm for deployment but it is also possible to deploy Loculus without Kubernetes.

diff --git a/preprocessing/nextclade/README.md b/preprocessing/nextclade/README.md
@@ -10,6 +10,7 @@ This preprocessing pipeline is still a work in progress. It requests unaligned n
 1. Run Nextclade on sequences
 1. Parse Nextclade results
 1. Delete temporary directory
+1. Perform other metadata checks and formatting (see [Preprocessing Checks](#preprocessing-checks))
 1. Submit results to server
 
 ## Setup
@@ -94,6 +95,7 @@ However, the `preprocessing` field can be customized to take an arbitrary number
 1. `process_date`: Take a date string and return a date field in the "%Y-%m-%d" format
 2. `parse_timestamp`: Take a timestamp e.g. 2022-11-01T00:00:00Z and return that field in the "%Y-%m-%d" format
 3. `concatenate`: Take multiple metadata fields (including the accessionVersion) and concatenate them in the order specified by the `arg.order` parameter, fields will first be processed based on their `arg.type` (the order of the types should correspond to the order of fields specified by the order argument).
+4. `process_options`: Only accept input that is in `args.options`, this check is case-insensitive. If input value is not in options return null.
 
 Using these functions in your `values.yaml` will look like:
 
@@ -114,4 +116,15 @@ Using these functions in your `values.yaml` will look like:
       args:
          order: [geo_loc_country, accession_version, sample_collection_date]
          type: [string, string, date]
+- name: country
+   preprocessing:
+      function: process_options
+      inputs:
+         input: geo_loc_country
+      args:
+         options:
+            - Argentina
+            - Bolivia
+            _ Columbia
+            -...
 ```
diff --git a/preprocessing/specification.md b/preprocessing/specification.md
@@ -8,31 +8,18 @@ The preprocessing pipeline prepares the data uploaded by the submitters for rele
 
 ### Tasks
 
-In following, we list a series of tasks that the preprocessing pipeline would usually perform. Hereby, the developers of a preprocessing pipeline has much flexibility in deciding how and to which extent the pipeline does the tasks. The only rule is that the output of the pipeline has to conform to the format expected by the Loculus backend. For example, a preprocessing pipeline can be very "generous and intelligent" and accept a wide range of values for a date (e.g., it may map "Christmas 2020" to "2020-12-25") or be very restrictive and throw an error for any value that does not follow the ISO-8601 format.
+In following, we list a series of tasks that the preprocessing pipeline would usually perform. The developers of a preprocessing pipeline have much flexibility in deciding how and to which extent the pipeline does a task: the only rule is that the output of the pipeline has to conform to the format expected by the Loculus backend, see [Technical specification](#technical-specification). For example, a preprocessing pipeline can be very "generous and intelligent" and accept a wide range of values for a date (e.g., it may map "Christmas 2020" to "2020-12-25") or be very restrictive and throw an error for any value that does not follow the ISO-8601 format.
 
 **Parsing:** The preprocessing pipeline receives the input data as strings and transforms them into the right format. For example, assuming there is a field `age` of type `integer`, given an input `{"age": "2"}` the preprocessing pipeline should transform it to `{"age": 2}` (simple type conversion). In another case, assuming there is a field `sequencingDate` of type `date`, the preprocessing pipeline might transform `{"sequencingDate": "30 August 2023"}` to the expected format of `{"sequencingDate": "2023-08-30"}`.
 
 **Validation:** The preprocessing pipeline checks the input data and emits errors or warnings. As mentioned above, the only constraint is that the output of the preprocessing pipeline conforms to the right (technical) format. Otherwise, a pipeline may be generous (e.g., allow every value in the "country" field) or be more restrictive (e.g., only allow a fixed set of values in the "country" field).
 
-**Alignment and translations:** The submitter only provides unaligned nucleotide sequences. To allow searching by nucleotide and amino acid mutations, the preprocessing pipeline must perform the alignment and compute the translations to amino acid sequences. 
+**Alignment and translations:** The submitter only provides unaligned nucleotide sequences. If you want to allow searching by nucleotide and amino acid mutations, the preprocessing pipeline needs to perform the alignment and compute the translations to amino acid sequences.
 
 **Annotation:** The preprocessing pipeline can add annotations such as clade/lineage classifications.
 
 **Quality control (QC):** The preprocessing pipeline should check the quality of the sequences (and the metadata).
 
-### Glossary
-
-- **Loculus instance:** one Loculus installation consisting of potentially multiple organism instances
-- **Organism instance:** one organism-specific instance with a fixed set of possible metadata and a fixed reference genome
-- **Organism instance schema:** the definition of the accepted metadata fields and information about the reference genome (names of the segments and genes/peptides). Each organism instance has a schema.
-- **Backend:** The backend server is developed by the Loculus team. The same backend software is used across Loculus and organism instances. To support different organisms and metadata fields, it can be configured through a configuration file.
-- **Preprocessing pipeline:** The preprocessing pipeline takes unpreprocessed data and generates preprocessed data. The Loculus team provides reference implementations but Loculus can be used with other implementations as long as they follow the specification detailed in this document.
-- **LAPIS and SILO:** the data querying engine and API used by Loculus.
-- **Sequence entry:** A sequence entry consists of a genome sequence (or sequences if the organisms has a segmented genome) and associated metadata. It is the main entity of the Loculus application. Users submit sequence entries and search for sequence entries. Each sequence entry has its own accession. Changes to sequence entries are versioned, meaning that a sequence entry can have multiple versions.
-- **Unpreprocessed data:** sequence entries as provided by the submitters
-- **Preprocessed data:** sequence entries after being processed by the preprocessing pipeline. The preprocessed data must be consistent with the organism instance schema and will be passed to LAPIS and SILO.
-- **Nucleotide sequence segment:** A nucleotide sequence consists of one or multiple segments. If there is only a single segment (e.g., as in SARS-CoV-2), the segment name should be `main`. For multi-segmented sequences, the segment names must match the corresponding reference genomes.
-
 ## Workflow overview
 
 1. The preprocessing pipeline calls the backend and receives some unpreprocessed data.
@@ -52,8 +39,8 @@ To retrieve unpreprocessed data, the preprocessing pipeline sends a POST request
 In the unprocessed NDJSON, each line contains a sequence entry represented as a JSON object and looks as follows:
 
 ```
-{"accession": 1, "version": 1, "data": {"metadata": {...}, "unalignedNucleotideSequences": {...}}}
-{"accession": 2, "version": 1, "data": {"metadata": {...}, "unalignedNucleotideSequences": {...}}}
+{"accession": 1, "version": 1, "data": {"metadata": {...}, "unalignedNucleotideSequences": {...}}, "submitter": insdc_ingest_user, ...}
+{"accession": 2, "version": 1, "data": {"metadata": {...}, "unalignedNucleotideSequences": {...}}, "submitter": john_smith, ...}
 ```
 
 The `metadata` field contains a flat JSON object in which all values are strings. The fields and values correspond to the columns and values as provided by the submitter.
@@ -70,6 +57,10 @@ One JSON object has the following fields:
         metadata: Record<string, string>,
         unalignedNucleotideSequences: Record<string, string>
     }
+    submitter: string,
+    submissionId: string,
+    groupId: int,
+    submittedAt: int
 }
 ```
 
@@ -123,10 +114,11 @@ The `message` should contain a human-readable message describing the error.
 The `metadata` field should contain a flat object consisting of the fields specified in the organism instance schema. The values must be correctly typed. Currently, the following types are available:
 
 - `string`
-- `integer`
+- `int` (integer)
 - `float`
 - `date` (supplied as a string with complete ISO-8601 date, e.g., "2023-08-30")
-- `pangoLineage` (supplied as a string with a properly formatted SARS-CoV-2 Pango lineage, e.g., "B.1.1.7")
+- `pango_lineage` (supplied as a string with a properly formatted SARS-CoV-2 Pango lineage, e.g., "B.1.1.7")
+- `authors` (comma separated list of authors, treated as a string in the current prepro pipeline)
 
 #### Sequences
 
@@ -157,8 +149,19 @@ SARS-CoV-2 amino acid sequences:
 
 #### Insertions
 
-This is not yet specified.
-TODO https://github.com/loculus-project/loculus/issues/823
+The `nucleotideInsertions` and `aminoAcidInsertions` fields contain dictionaries from the segment name or gene name to a list of `NucleotideInsertion`s or `AminoAcidInsertion`s. If there are no insertions the list should be empty.
+
+Examples:
+
+Nucleotide insertions for a multi-segmented organism:
+
+```
+{
+    "seg1": ["248:G", "21881:GAG"],
+    "seg2": [],
+    ...
+}
+```
 
 ## Reprocessing