Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CCHFV to loculus #1920

Merged
merged 81 commits into from
Jun 6, 2024
Merged

Add CCHFV to loculus #1920

merged 81 commits into from
Jun 6, 2024

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented May 15, 2024

Summary

This PR adds the multi-segmented pathogen CCHFV to loculus. It does the following:

  • Update ingest pipeline to work for multi-segmented viruses: take info from multiple segments, merge metadata, compare hashes with multiple segments, implement reingest for multiple segments.

  • Add cchfv nextclade_dataset.

  • Add segmented option to metadata items in values.yaml. Adding the segmented option means that the metadata item is different for each segment. Thus, the config files will create #(segments) metadata items, each with the segment name concatenated to the end of the name, i.e. name_segment.

  • Update preprocessing pipelines: take multi segmented input.

resolves #1915

preview URL: https://ccfv.loculus.org/cchf/search

Screenshot

image

PR Checklist

  • Get the nextclade_datasets to work with docker -> Add CCHFV to nextclade_data. nextstrain/nextclade_data#199
  • Modify ingest pipeline further: properly hash for multiple segments, join segments based on isolate and join metadata correctly.
  • Make sure that all config files are correctly generated, fix some small bugs: get segments from config directly, make sure that displayNames are correctly modified
  • Fix reingest bug
  • Fix website sequence tab box bug
  • Refactor ingest to only merge groups after processing metadata
  • Fix preprocessing config file: instead of adding segmented field generate dictionary with renamed fields (like in other configs)
  • Upload ref sequences to corneliusroemer/seqs to avoid values.yaml getting too large
  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test. -> moved to a later PR

@anna-parker anna-parker added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels May 15, 2024
Update format_segmented_viruses to also update metadata.

Add cchfv nextclade_dataset (still need to figure out how to add clade_memberships to the auspice trees) and start to modify the preprocessing pipeline to allow for multiple segments.

Add correct trees to nextclade_datasets and update preprocessing pipelines to take multiple segments.
…nfigs and a dict in referencegenome configs.
@anna-parker anna-parker marked this pull request as ready for review May 23, 2024 12:11
@anna-parker
Copy link
Contributor Author

I only didn't approve initially because I was hoping @corneliusroemer might be able to have a look too. I still think that would be a good idea, but approving to avoid blocking if Cornelius isn't able to

Thanks Theo! @corneliusroemer just created #2100 as a review - I am looking through his suggestions and will merge them into this branch before pushing this branch to main

@corneliusroemer
Copy link
Contributor

I don't mind either way, you could merge as is and I could make my PR a refactor PR, or Anya goes through it and we merge into this one first.

I should be able to finish my review/refactor today so it should be ok to wait. But if I stall, feel free to merge this one here.

corneliusroemer and others added 21 commits June 6, 2024 13:37
It works, but is a bit brittle
We probably want to use `nextclade_split` in the future rather than header information
…o level lower than DEBUG

we should use INFO as default, and debug only when we're actually debugging
…n change

Conditionally include segmented rules to avoid ruleorder ambiguity
…nput is that this way snakemake will automatically rerun the rule when the script has been changed during development
…case

nucleotideSequences -> nucleotide_sequences for consistency
@anna-parker
Copy link
Contributor Author

As discussed offline with @corneliusroemer - I merged all of Cornelius' suggestions except for the refactor of the group_segments.py file. As the new grouping modifies the my grouping slightly I will review it separately and merge this branch now in its current state into main. Thanks @theosanderson and @corneliusroemer for all the comments and suggestions! I know this was a lot of work to review :-)

@corneliusroemer
Copy link
Contributor

I might have been unclear but I wasn't done yet, just made it to mid-ingest 😀 but as discussed, happy to be merge and I'll just keep a review branch.

@anna-parker anna-parker merged commit b753e78 into main Jun 6, 2024
13 checks passed
@anna-parker anna-parker deleted the ccfv branch June 6, 2024 17:52
@chaoran-chen
Copy link
Member

Amazing work, thanks @anna-parker!

@@ -127,10 +145,6 @@ To be able to run tests independently, we should use UUIDs for mock data. Curren

One complication for testing is that we don't have ARM containers for the backend yet (see <https://github.com/loculus-project/loculus/issues/1765>).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do now 😀

ingest/Snakefile Outdated
rule prepare_metadata:
input:
metadata="results/metadata_post_rename.tsv",
metadata=get_prepare_metadata(SEGMENTED),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a function here, we can just use an inline foo if boolean else baz expression. Calling the variable wildcard looks like a ChatGPT hallucination 😀

@@ -45,7 +46,7 @@ def split_authors(authors: str) -> str:
else:
result.append(single_split[i].strip())

return ", ".join(result)
return ", ".join(sorted(result))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where authors get sorted @theosanderson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-segmented virus support for ingest and preprocessing
4 participants