loculus-project · anna-parker · Jun 6, 2024 · May 10, 2024 · May 15, 2024 · May 15, 2024
diff --git a/docs/src/content/docs/guides/getting-started.md b/docs/src/content/docs/guides/getting-started.md
@@ -134,12 +134,50 @@ organisms:
 
 In this example, the configuration for the "ebolavirus-sudan" organism is defined. It includes schema settings, website display options, silo configuration, preprocessing details, and reference genome information.
 
-Note the metadata section includes various fields for how the metadata of specific sequences should be displayed. Each metadata item must have a `name` which will also be displayed on the page unless `displayName` is also set. The `type` of the data, as well as if the field is `required` and if `autoComplete` is enabled can also be added. Additionally, links from metadata entries to external websites can be added using the `customDisplay` option. We also allow metadata to be grouped in sections, specified by the `header` field.
+Note the metadata section includes various fields for how the metadata of specific sequences should be displayed. Each metadata item must have a `name` which will also be displayed on the page unless `displayName` is also set. The `type` of the data, as well as if the field is `required` and if `autoComplete` is enabled can also be added. Additionally, links from metadata entries to external websites can be added using the `customDisplay` option. We also allow metadata to be grouped in sections, specified by the `header` field. The `noInput` parameter specifies that a parameter is generated internally by loculus (can be specified in the preprocessing pipeline) and should not be expected as input.
+
+Your preprocessing pipeline can be customized for each organism. Currently, we use `nextclade run` in our preprocessing pipeline and we suggest it as a fast option to do basic checks on your input sequences. Given a `nextclade dataset` (in its simplest form a reference sequence and a gene_annotation file) nextclade tries to align new sequences to the reference and will discard sequences that cannot be aligned. It will also compute mutations, insertions and deletions for the nucleotide sequence as well as for the corresponding genes. If you would like to use our preprocessing set-up you can add a nextclade dataset to your `values.yaml` as follows:
+
+```yaml
+preprocessing:
+  - version: 1
+    image: ghcr.io/loculus-project/preprocessing-nextclade
+    args:
+      - "prepro"
+    configFile:
+      log_level: DEBUG
+      nextclade_dataset_name: nextstrain/ebola/zaire
+      genes: [NP, VP35, VP40, GP, sGP, ssGP, VP30, VP24, L]
+      batch_size: 100
+```
 
 Additionally, the `tableColumns` section defines which metadata fields are shown as columns in the search results.
 
 You can add multiple organisms under the organisms section, each with its own unique configuration.
 
+### Multi-segmented pathogens
+
+In Loculus, sequence data from multi-segmented viruses is stored in accessioned sequence entries which group together the segments from a particular sample or isolate. Multi-segmented organisms should be annotated with a list with the names of the segments supplied as `nucleotideSequences`. For CCHFV this looks like:
+
+```yaml
+organisms:
+  cchf:
+    schema:
+      organismName: "Crimean-Congo Hemorrhagic Fever Virus"
+      nucleotideSequences: [L, M, S]
+      metadata:
+        - name: length
+          type: int
+          header: "Length"
+          per_segment: true
+```
+
+Additionally, if you are using the preprocessing or ingest pipelines, `nucleotideSequences` must also be defined in those sections of the config.
+
+Metadata fields can be isolate- or segment-specific. By default we assume metadata fields are isolate-specific (i.e. are shared across all segments), therefore segment-specific fields must be marked as `per_segment` in the config file. Marking a field as `per_segment` will result in that field existing for each segment. In the example above, instead of there being one metadata field called `length` there will now be three fields called `length_L`, `length_M` and `length_S`.
+
+Loculus expects multi-segmented pathogen sequences to be submitted in a specific format. Fasta files should have a separate entry/record for each segment, with a Fasta header of `>[submissionID]_[segmentName]`, e.g. `>sample123_L` for the `L` segment of the sample with the submissionID `sample123`. Metadata is uploaded for an entire sequence entry, rather than per segment, i.e. there will be only one row for each `submissionID`.
+
 ## Secrets
 
 Our secrets configuration supports three types of secrets.

diff --git a/ingest/README.md b/ingest/README.md
@@ -18,6 +18,12 @@ Using NCBI `datasets` CLI, download all sequences and corresponding NCBI curated
 
 Sequences and metadata are transformed into (nd)json files to simplify (de)serialization and further processing.
 
+### Segmented viruses
+
+NCBI handles segmented viruses differently than Loculus. In NCBI, the primary level of accession is per segment of a genomic sequence, with each segment having its own metadata. In Loculus a sample is uploaded with all its segments grouped under a collective accession ID, and metadata applies at the sample (or group) level. FASTA files when downloaded have each segment headed under `>[accessionID]_[segmentName]`. (When uploaded to Loculus they need be headed as `>[submissionID]_[segmentName]`)
+
+The segment a sample corresponds to can only be determined from the descriptions of a sequence fasta record. In `get_segment_details.py` we discard all sequences with unclear segment annotations and add `segment` as a metadata field. (TODO #2079: Use nextclade instead of a regex search to determine which segment the sequence aligns with best to keep as much data as possible).
+
 ### Transforming values to conform with Loculus' expectations
 
 Metadata as received from `datasets` is transformed to conform to Loculus' expectations. This includes for example:
@@ -26,15 +32,25 @@ Metadata as received from `datasets` is transformed to conform to Loculus' expec
 - transforming values, e.g. turn author strings from `LastName1, Initial1, LastName2, Initial2` into `Initial1 LastName1, Initial2 LastName2`
 - splitting fields, e.g. NCBI's single, complex collection country field (`Germany: Munich, Bavaria`) is split into multiple fields `country`, `state`, `division` (`Germany`, `Bavaria`, `Munich`)
 
+Note that the `submissionId` is just the `genbank_accession` for non-segmented viruses, but the concatenation of the `genbank_accession` of each segment (with the appended segment name for each segment) for segmented viruses.
+
 ### Calculating a hash for each sequence entry
 
 Every sequence entry is to be uploaded only once and must be ignored by future periodic ingest runs unless the metadata and/or sequence has changed.
 
 To achieve this, an md5 hash is generated for each sequence entry based on the post-transform metadata and sequence content. The hash is based on all metadata fields submitted to Loculus as well as the sequence. Hence, changes to the ingest pipeline's transform step (above) can lead to changes in hash and resubmission - even without underlying data change on INSDC. Likewise, some changes to the INSDC data might not cause a sequence update on Loculus if what has been changed does not affect the post-transformed metadata.
 
+For segmented viruses we calculate the md5 hash of each segment and then, after grouping segments we concatenate the hashes of each segment before again hashing the hashes.
+
+### Grouping segmented viruses
+
+In NCBI sequences are uploaded for each segment separately. To upload all segments from the same isolate we need to group the sequences. We do this by grouping NCBI segments based on `ncbi_isolate_name` and other isolate-specific attributes. Segments will only be uploaded together if all these parameters match. We also add additional checks to prevent multiple sequences of the same segment being grouped together. If a check fails, or the segments do not have isolate information, the segments will be ingested and uploaded to Loculus individually.
+
+We group segments by adding a `joint_accession` field to the metadata which consists of a concatenated list of all `genbank_accession` IDs of the segments in the group. Each fasta record is also modified to use `joint_accession` with the concatenated segment as their ID (as required by loculus).
+
 ### Getting status and hashes of previously submitted sequences and triaging
 
-Before uploading new sequences, the pipeline queries the Loculus backend for the status and hash of all previously submitted sequences. This is done to avoid uploading sequences that have already been submitted and have not changed. Furthermore, only accessions whose higest version is in status `APPROVED_FOR_RELEASE` can be updated through revision. Entries in other states cannot currently be updated (TODO: Potentially use `/submit-edited-data` endpoint to allow updating entries in more states).
+Before uploading new sequences, the pipeline queries the Loculus backend for the status and hash of all previously submitted sequences. This is done to avoid uploading sequences that have already been submitted and have not changed. Furthermore, only accessions whose highest version is in status `APPROVED_FOR_RELEASE` can be updated through revision. Entries in other states cannot currently be updated (TODO: Potentially use `/submit-edited-data` endpoint to allow updating entries in more states).
 
 Hashes and statuses are used to triage sequences into 4 categories which determine the action to be taken:
 
@@ -79,7 +95,7 @@ In production the ingest pipeline runs in a docker container that takes a config
 We use the Snakemake workflow management system which also uses different config files:
 
 - `profiles/default/config.yaml` sets default command line options while running Snakemake.
-- `config/config.yaml` turns into a config dict in the `Snakefile` (attributes defined in this file become global variables in the `Snakefile`).
+- `config/config.yaml` and `config/defaults.yaml` turn into a config dict in the `Snakefile` (attributes defined in this file become global variables in the `Snakefile`). The `config/config.yaml` file used in production is generated by the `kubernetes/loculus/templates/loculus-preprocessing-config.yaml`.
 
 TLDR: The `Snakefile` contains workflows defined as rules with required input and expected output files. By default Snakemake takes the first rule as the target one and then constructs a graph of dependencies (a DAG) required to produce the expected output of the first rule. The target rule can be specified using `snakemake {rule}`
 
@@ -109,6 +125,8 @@ Then run snakemake using `snakemake` or `snakemake {rule}`.
 
 Note that by default the pipeline will submit sequences to main. If you want to change this to another branch (that has a preview tag) you can modify the `backend_url` and `keycloak_token_url` arguments in the `config.yaml` file. They are of the form `https://backend-{branch_name}.loculus.org/` and `https://authentication-{branch_name}.loculus.org`. Alternatively, if you are running the backend locally you can also specify the local backend port: `http://localhost:8079` and the local keyclock port: `http://localhost:8083`.
 
+The ingest pipeline requires config files, found in the directory `config`. The `default.yaml` contains default values which will be overridden by the `config.yaml`. To produce the `config.yaml` used in production you can run `../generate_local_test_config.sh` and then copy the configs from the pathogen to the `config.yaml`.
+
 ## Testing
 
 Currently, there is not automated testing other than running the pipeline manually and in preview deployments.
@@ -127,10 +145,6 @@ To be able to run tests independently, we should use UUIDs for mock data. Curren
 
 One complication for testing is that we don't have ARM containers for the backend yet (see <https://github.com/loculus-project/loculus/issues/1765>).
 
-### Multi-segment support
-
-Currently, the pipeline only supports single-segment sequences. We need to add support for multi-segment viruses, like CCHF and Influenza.
-
 ### Recover from processing errors
 
 At some point we should be able to recover from processing errors by using `/submit-edited-data` to update entries in more states than just `APPROVED_FOR_RELEASE`.
diff --git a/ingest/Snakefile b/ingest/Snakefile
@@ -10,11 +10,20 @@ for key, value in defaults.items():
     if not key in config:
         config[key] = value
 
+# Check if organism is segmented
+if "nucleotideSequences" not in config:
+    config["nucleotideSequences"] = ["main"]
+config["segmented"] = not (
+    len(config["nucleotideSequences"]) == 1
+    and config["nucleotideSequences"][0] == "main"
+)
+
 Path("results").mkdir(parents=True, exist_ok=True)
 with open("results/config.yaml", "w") as f:
     f.write(yaml.dump(config))
 
 TAXON_ID = config["taxon_id"]
+SEGMENTED = config["segmented"]
 ALL_FIELDS = ",".join(config["all_fields"])
 COLUMN_MAPPING = config["column_mapping"]
 LOG_LEVEL = config.get("log_level", "INFO")
@@ -50,17 +59,54 @@ rule fetch_ncbi_dataset_package:
         """
 
 
+def get_extract_output(wildcard):
+    if wildcard:
+        return ("results/sequences_full.fasta",)
+    else:
+        return ("results/sequences.fasta",)
+
+
 rule extract_ncbi_dataset_sequences:
     input:
         dataset_package="results/ncbi_dataset.zip",
     output:
-        ncbi_dataset_sequences="results/sequences.fasta",
+        ncbi_dataset_sequences=get_extract_output(SEGMENTED),
+    params:
+        segmented=SEGMENTED,
     shell:
         """
-        unzip -jp {input.dataset_package} \
+        if [[ {params.segmented} ]]; then
+            unzip -jp {input.dataset_package} \
+            ncbi_dataset/data/genomic.fna \
+            | seqkit seq -w0 \
+            > {output.ncbi_dataset_sequences}
+        else
+            unzip -jp {input.dataset_package} \
             ncbi_dataset/data/genomic.fna \
-        | seqkit seq -i -w0 \
-        > {output.ncbi_dataset_sequences}
+            | seqkit seq -i -w0 \
+            > {output.ncbi_dataset_sequences}
+        fi
+        """
+
+
+rule get_segment_details:
+    """Check if viruses are segmented, if so add segment to metadata"""
+    input:
+        sequences="results/sequences_full.fasta",
+        script="scripts/get_segment_details.py",
+        ncbi_dataset_tsv="results/metadata_post_rename.tsv",
+        config="results/config.yaml",
+    output:
+        sequences_processed="results/sequences.fasta",
+        ncbi_dataset_tsv="results/metadata_post_segment.tsv",
+    shell:
+        """
+        python {input.script} \
+            --config-file {input.config} \
+            --input-seq {input.sequences} \
+            --input-metadata {input.ncbi_dataset_tsv} \
+            --output-seq {output.sequences_processed} \
+            --output-metadata {output.ncbi_dataset_tsv}
         """
 
 
@@ -106,9 +152,16 @@ rule rename_columns:
         rename_columns(input.ncbi_dataset_tsv, output.ncbi_dataset_tsv)
 
 
+def get_prepare_metadata(wildcard):
+    if wildcard:
+        return ("results/metadata_post_segment.tsv",)
+    else:
+        return ("results/metadata_post_rename.tsv",)
+
+
 rule prepare_metadata:
     input:
-        metadata="results/metadata_post_rename.tsv",
+        metadata=get_prepare_metadata(SEGMENTED),
         sequence_hashes="results/sequence_hashes.json",
         config="results/config.yaml",
         script="scripts/prepare_metadata.py",
@@ -127,6 +180,36 @@ rule prepare_metadata:
         """
 
 
+rule group_segments:
+    input:
+        metadata="results/metadata_post_prepare.json",
+        sequences="results/sequences.json",
+        config="results/config.yaml",
+        script="scripts/group_segments.py",
+    output:
+        metadata="results/metadata_post_group.json",
+        sequences="results/sequences_post_group.json",
+    params:
+        log_level=LOG_LEVEL,
+    shell:
+        """
+        python scripts/group_segments.py \
+            --config-file {input.config} \
+            --input-metadata {input.metadata} \
+            --input-seq {input.sequences} \
+            --output-metadata {output.metadata} \
+            --output-seq {output.sequences} \
+            --log-level {params.log_level} \
+        """
+
+
+def get_grouped_metadata(wildcard):
+    if wildcard:
+        return ("results/metadata_post_group.json",)
+    else:
+        return ("results/metadata_post_prepare.json",)
+
+
 rule get_previous_submissions:
     """Download metadata and sequence hashes of all previously submitted sequences 
     Produces mapping from INSDC accession to loculus id/version/hash
@@ -142,7 +225,7 @@ rule get_previous_submissions:
         ...
     """
     input:
-        prepped_metadata="results/metadata_post_prepare.json",  # Reduce likelihood of race condition of multi-submission
+        prepped_metadata=get_grouped_metadata(SEGMENTED),  # Reduce likelihood of race condition of multi-submission
         config="results/config.yaml",
         script="scripts/call_loculus.py",
     output:
@@ -163,8 +246,9 @@ rule get_previous_submissions:
 
 rule compare_hashes:
     input:
+        config="results/config.yaml",
         old_hashes="results/previous_submissions.json",
-        metadata="results/metadata_post_prepare.json",
+        metadata=get_grouped_metadata(SEGMENTED),
         script="scripts/compare_hashes.py",
     output:
         to_submit="results/to_submit.json",
@@ -177,6 +261,7 @@ rule compare_hashes:
     shell:
         """
         python scripts/compare_hashes.py \
+            --config-file {input.config} \
             --old-hashes {input.old_hashes} \
             --metadata {input.metadata} \
             --to-submit {output.to_submit} \
@@ -188,10 +273,18 @@ rule compare_hashes:
         """
 
 
+def get_grouped_sequences(wildcard):
+    if wildcard:
+        return ("results/sequences_post_group.json",)
+    else:
+        return ("results/sequences.json",)
+
+
 rule prepare_files:
     input:
-        metadata="results/metadata_post_prepare.json",
-        sequences="results/sequences.json",
+        config="results/config.yaml",
+        metadata=get_grouped_metadata(SEGMENTED),
+        sequences=get_grouped_sequences(SEGMENTED),
         to_submit="results/to_submit.json",
         to_revise="results/to_revise.json",
         script="scripts/prepare_files.py",
@@ -203,6 +296,7 @@ rule prepare_files:
     shell:
         """
         python scripts/prepare_files.py \
+            --config-file {input.config} \
             --metadata-path {input.metadata} \
             --sequences-path {input.sequences} \
             --to-submit-path {input.to_submit} \

diff --git a/ingest/config/config.yaml b/ingest/config/config.yaml
@@ -2,3 +2,12 @@ taxon_id: 186538
 backend_url: https://backend-main.loculus.org/
 keycloak_token_url: https://authentication-main.loculus.org/realms/loculus/protocol/openid-connect/token
 organism: ebola-zaire
+
+# taxon_id:  3052518
+# backend_url: http://localhost:8079/
+# keycloak_token_url: http://localhost:8083/realms/loculus/protocol/openid-connect/token
+# organism: cchf
+# nucleotideSequences:
+#   - M
+#   - L
+#   - S
diff --git a/ingest/config/defaults.yaml b/ingest/config/defaults.yaml
@@ -5,13 +5,17 @@ log_level: DEBUG
 compound_country_field: ncbi_geo_location
 fasta_id_field: genbank_accession
 rename:
+  bioprojects: bioproject_accessions
+  country: geo_loc_country
+  division: geo_loc_admin_1
   genbank_accession: insdc_accession_full
-  ncbi_collection_date: collection_date
-  ncbi_isolate_name: isolate_name
-  ncbi_isolate_source: isolate_source
-  ncbi_sra_accessions: sra_accessions
-  ncbi_submitter_affiliation: author_affiliation
-  ncbi_submitter_country: submitter_country
+  ncbi_collection_date: sample_collection_date
+  ncbi_host_name: host_name_scientific
+  ncbi_host_tax_id: host_taxon_id
+  ncbi_is_lab_host: is_lab_host
+  ncbi_isolate_name: specimen_collector_sample_id
+  ncbi_sra_accessions: sra_run_accession
+  ncbi_submitter_affiliation: author_affiliations
   ncbi_submitter_names: authors
 keep:
   - division
@@ -33,6 +37,16 @@ keep:
   - ncbi_virus_tax_id
   - sequence_md5
   - genbank_accession
+  - joint_accession
+segment_specific:
+  - biosample_accession
+  - bioproject_accessions
+  - ncbi_completeness
+  - sra_run_accession
+  - ncbi_protein_count
+  - insdc_accession_base
+  - insdc_version
+  - insdc_accession_full
 all_fields:
   - accession
   - bioprojects