Add CCHFV to loculus #1920

anna-parker · 2024-05-15T14:27:25Z

Summary

This PR adds the multi-segmented pathogen CCHFV to loculus. It does the following:

Update ingest pipeline to work for multi-segmented viruses: take info from multiple segments, merge metadata, compare hashes with multiple segments, implement reingest for multiple segments.
Add cchfv nextclade_dataset.
Add segmented option to metadata items in values.yaml. Adding the segmented option means that the metadata item is different for each segment. Thus, the config files will create #(segments) metadata items, each with the segment name concatenated to the end of the name, i.e. name_segment.
Update preprocessing pipelines: take multi segmented input.

resolves #1915

preview URL: https://ccfv.loculus.org/cchf/search

Screenshot

PR Checklist

Update format_segmented_viruses to also update metadata. Add cchfv nextclade_dataset (still need to figure out how to add clade_memberships to the auspice trees) and start to modify the preprocessing pipeline to allow for multiple segments. Add correct trees to nextclade_datasets and update preprocessing pipelines to take multiple segments.

…nfigs and a dict in referencegenome configs.

anna-parker · 2024-06-05T09:09:06Z

I only didn't approve initially because I was hoping @corneliusroemer might be able to have a look too. I still think that would be a good idea, but approving to avoid blocking if Cornelius isn't able to

Thanks Theo! @corneliusroemer just created #2100 as a review - I am looking through his suggestions and will merge them into this branch before pushing this branch to main

corneliusroemer · 2024-06-05T09:34:36Z

I don't mind either way, you could merge as is and I could make my PR a refactor PR, or Anya goes through it and we merge into this one first.

I should be able to finish my review/refactor today so it should be ok to wait. But if I stall, feel free to merge this one here.

It works, but is a bit brittle We probably want to use `nextclade_split` in the future rather than header information

…o level lower than DEBUG we should use INFO as default, and debug only when we're actually debugging

…zed manner

…n change Conditionally include segmented rules to avoid ruleorder ambiguity

…nput is that this way snakemake will automatically rerun the rule when the script has been changed during development

…case nucleotideSequences -> nucleotide_sequences for consistency

anna-parker · 2024-06-06T17:41:31Z

As discussed offline with @corneliusroemer - I merged all of Cornelius' suggestions except for the refactor of the group_segments.py file. As the new grouping modifies the my grouping slightly I will review it separately and merge this branch now in its current state into main. Thanks @theosanderson and @corneliusroemer for all the comments and suggestions! I know this was a lot of work to review :-)

corneliusroemer · 2024-06-06T17:47:41Z

I might have been unclear but I wasn't done yet, just made it to mid-ingest 😀 but as discussed, happy to be merge and I'll just keep a review branch.

chaoran-chen · 2024-06-06T18:08:24Z

Amazing work, thanks @anna-parker!

corneliusroemer · 2024-06-04T00:44:37Z

ingest/README.md

@@ -127,10 +145,6 @@ To be able to run tests independently, we should use UUIDs for mock data. Curren

 One complication for testing is that we don't have ARM containers for the backend yet (see <https://github.com/loculus-project/loculus/issues/1765>).


We do now 😀

corneliusroemer · 2024-06-04T00:58:00Z

ingest/Snakefile

 rule prepare_metadata:
    input:
-        metadata="results/metadata_post_rename.tsv",
+        metadata=get_prepare_metadata(SEGMENTED),


No need for a function here, we can just use an inline foo if boolean else baz expression. Calling the variable wildcard looks like a ChatGPT hallucination 😀

corneliusroemer · 2024-08-28T16:13:30Z

ingest/scripts/prepare_metadata.py

@@ -45,7 +46,7 @@ def split_authors(authors: str) -> str:
        else:
            result.append(single_split[i].strip())

-    return ", ".join(result)
+    return ", ".join(sorted(result))


This is where authors get sorted @theosanderson

anna-parker added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels May 15, 2024

anna-parker force-pushed the ccfv branch from 39674e2 to 547c459 Compare May 15, 2024 17:55

anna-parker added 3 commits May 22, 2024 09:17

Fix ingest for single segment case

21d2877

Fix: values.yaml - nucleotideSequences need to be a list in prepro co…

d65a076

…nfigs and a dict in referencegenome configs.

anna-parker force-pushed the ccfv branch from 5c44f59 to 43fc8cf Compare May 22, 2024 07:24

anna-parker added 5 commits May 22, 2024 17:50

Add correct genome annotations from NCBI

fef7ebe

Update configs to use githubusercontent for nextclade_datasets.

c323743

Use new dataset link

309dfeb

Fix preprocessing issues after default values.yaml changes.

c94ba9f

Add segmented as a config param

4756ffe

anna-parker force-pushed the ccfv branch from 73ed76b to 4756ffe Compare May 22, 2024 15:50

anna-parker added 7 commits May 23, 2024 08:04

Join segments based on isolate name.

49ff8e2

Fix some prepro issues

1d9df16

Add default config changes

e0f8801

Update silo configs

9dea930

Remove preprocessing temp results file.

b3c7645

Fix cchfv table columns as metdata has now been renamed.

530cb30

Fix author_affiliations

5717cd4

anna-parker marked this pull request as ready for review May 23, 2024 12:11

anna-parker added 6 commits May 23, 2024 14:15

Merge branch 'main' into ccfv

61bb4f7

Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv

7069e0b

Fix merge issues with instanceName.

9d1eb2a

Merge branch 'main' into ccfv

f375fb7

Fix prepare_metdata bug.

1d915a1

Merge branch 'ccfv' of github.com:loculus-project/loculus into ccfv

320d1f3

corneliusroemer and others added 21 commits June 6, 2024 13:37

Add dag for segmented

542606d

Simplify segmentation inference

0914133

Remove unnecessary/confusing functions

18db327

Simplify extraction script, DRYer

3cb089a

Reorder to never have rules do forward references

ef375a4

Remove unused function

94b9706

Keep top level dir clean by moving images to folder

fe20091

Review segment parsing script

3fb9060

It works, but is a bit brittle We probably want to use `nextclade_split` in the future rather than header information

Switch default log level to INFO, debug is very verbose and there's n…

47afdea

…o level lower than DEBUG we should use INFO as default, and debug only when we're actually debugging

Log a few important lines at INFO, not everything at debug only

14d784d

Avoid a very broad try/except block, if necessary, use in more locali…

1fbc79c

…zed manner

Mention all config in params: blocks, so snakemake can rerun rule o…

c6d0ceb

…n change Conditionally include segmented rules to avoid ruleorder ambiguity

Use input.script consistently (the advantage of using the script as i…

1012ecb

…nput is that this way snakemake will automatically rerun the rule when the script has been changed during development

All config files to be used by Python MUST use snake case, not camel …

2cfdb12

…case nucleotideSequences -> nucleotide_sequences for consistency

Fix ruff lints and unnecessary indentations

de3461b

Update documentation of group_segments

fac633e

Fix issues raised in get_segment_details

8543f6c

Fix weird error I introduced when merging changes

df9c57d

Go back to old regex as this catches more cases.

77119db

Update ingest config file

bcefc7c

Merge branch 'main' into ccfv

cd19b1c

anna-parker merged commit b753e78 into main Jun 6, 2024
13 checks passed

anna-parker deleted the ccfv branch June 6, 2024 17:52

anna-parker mentioned this pull request Jun 11, 2024

Tabs on seqDetails page for aligned sequences are incorrectly active for aligned sequences #2132

Closed

corneliusroemer reviewed Aug 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CCHFV to loculus #1920

Add CCHFV to loculus #1920

anna-parker commented May 15, 2024 •

edited

Loading

anna-parker commented Jun 5, 2024

corneliusroemer commented Jun 5, 2024

anna-parker commented Jun 6, 2024

corneliusroemer commented Jun 6, 2024

chaoran-chen commented Jun 6, 2024

corneliusroemer Jun 4, 2024

corneliusroemer Jun 4, 2024

corneliusroemer Aug 28, 2024

		@@ -127,10 +145,6 @@ To be able to run tests independently, we should use UUIDs for mock data. Curren

		One complication for testing is that we don't have ARM containers for the backend yet (see <https://github.com/loculus-project/loculus/issues/1765>).

Add CCHFV to loculus #1920

Add CCHFV to loculus #1920

Conversation

anna-parker commented May 15, 2024 • edited Loading

Summary

Screenshot

PR Checklist

anna-parker commented Jun 5, 2024

corneliusroemer commented Jun 5, 2024

anna-parker commented Jun 6, 2024

corneliusroemer commented Jun 6, 2024

chaoran-chen commented Jun 6, 2024

corneliusroemer Jun 4, 2024

Choose a reason for hiding this comment

corneliusroemer Jun 4, 2024

Choose a reason for hiding this comment

corneliusroemer Aug 28, 2024

Choose a reason for hiding this comment

anna-parker commented May 15, 2024 •

edited

Loading