Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest): Small ingest fixes for multi-segmented viruses #2316

Merged
merged 10 commits into from
Jul 23, 2024

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Jul 22, 2024

resolves #

preview URL: https://ingest-fixes.loculus.org/

Summary

Small changes to display:

  • do not show alignment states if length is 0, otherwise this gets ugly:
    image
  • allow option to group segment specific metadata under the same header (oneHeader argument in values.yaml) to reduce busy-ness of page.
  • image
  • do a hard word-break inside sequence details display to avoid issues like:
image

Alignment grouping fix:

  • We assign segment names using nextclade align. When a segment aligns to a reference nextclade can infer a clade name - we were using this name to assign segment names as in CCHF it was equivalent to the segment name. This is actually an exception and cannot be used in general. Here we change the code to use the actual segment name as defined by the nextclade dataset.

Small grouping changes/ Config changes

  • ncbi_update_date: as with taxon_id this is not necessarily the same and cannot be concatenated so we also set it to perSegment and oneHeader.

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Jul 22, 2024
@anna-parker anna-parker added the review please PR waiting for final review label Jul 23, 2024
@theosanderson
Copy link
Member

theosanderson commented Jul 23, 2024

taxon_id is not always shared among all segments: as this is an integer we cannot just concatenate the different fields when they differ. Instead we need to set taxon_id as perSegment and oneHeader.

This doesn't make sense to me. I mean I get that maybe NCBI erroneously has different taxon_id s for different segments that should be grouped, but IMO then we should come up with a consensus taxon ID as part of ingest.

(Actually is there any case in our current set of organisms where we expect the taxon ID to [meaningfully/usefully] vary within an organism, irrespective of segments?)

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Jul 23, 2024

Great stuff, nice PR message, makes it easy to review!

A few non-blocking comments:

do not show alignment states if length is 0, otherwise this gets ugly:

We might actually want to state explicitly "segment not included" rather than length: 0 - or even not list the segment at all? This would reduce white space a lot. But probably not worth the complexity at this point.

allow option to group segment specific metadata under the same header (oneHeader argument in values.yaml) to reduce busy-ness of page.

I'm not sure I understand what this does. Do you mean grouping of segment specific metadata of different segments under one header, so NCBI_accession_S, NCBI_accession_M, NCBI_accession_L could all be under the same header?

Great to have more per_segment fields and allow them grouped under same header! I think I got it from context now later on in your post.

This is nice:

image

ingest/Snakefile Outdated Show resolved Hide resolved
@anna-parker
Copy link
Contributor Author

taxon_id is not always shared among all segments: as this is an integer we cannot just concatenate the different fields when they differ. Instead we need to set taxon_id as perSegment and oneHeader.

This doesn't make sense to me. I mean I get that maybe NCBI erroneously has different taxon_id s for different segments that should be grouped, but IMO then we should come up with a consensus taxon ID as part of ingest.

(Actually is there any case in our current set of organisms where we expect the taxon ID to [meaningfully/usefully] vary within an organism, irrespective of segments?)

This is apparently an influenza "feature" (or bug depending how you look at it)- different clades of different segments would sometimes have different specialized taxon_ids - so in influenza it should be a perSegment field however for CCHF and (I think?) most other viruses this should be the same for all segments... But yes maybe not good to have this as perSegment in our default configs then... I can remove it here and try to overwrite that field for influenza configs

@corneliusroemer
Copy link
Contributor

This is apparently an influenza "feature" (or bug depending how you look at it)- different clades of different segments would sometimes have different specialized taxon_ids - so in influenza it should be a perSegment field however for CCHF and (I think?) most other viruses this should be the same for all segments... But yes maybe not good to have this as perSegment in our default configs then... I can remove it here and try to overwrite that field for influenza configs

I couldn't find an example, at least in CCHF of where we currently group segments with different virus tax ids.

Do you have concrete evidence for influenza where segments of the same virus have different taxon_ids? I might have been wrong when I suggested this and you might have taken my word for it. I'm not confident right now that this can indeed happen.

For CCHF, there are a few non-majority tax ids, but so far they are all only on a single segment:

❯ tsv-filter -H ~/Downloads/details\ \(18\).tsv --not-empty ncbi_virus_tax_id_S --str-not-in-fld ncbi_virus_tax_id_S:"3052518" | tsv-select -H -f ncbi_virus_tax_id_L,ncbi_virus_tax_id_M,ncbi_virus_tax_id_S | tsv-pretty 
ncbi_virus_tax_id_L  ncbi_virus_tax_id_M  ncbi_virus_tax_id_S
                                                       170517
                                                       170517
                                                       170517
                                                       170517
                                                       402373
                                                       402369
                                                       402370
                                                       402371
                                                       402372
                                                       402052

@anna-parker
Copy link
Contributor Author

This is apparently an influenza "feature" (or bug depending how you look at it)- different clades of different segments would sometimes have different specialized taxon_ids - so in influenza it should be a perSegment field however for CCHF and (I think?) most other viruses this should be the same for all segments... But yes maybe not good to have this as perSegment in our default configs then... I can remove it here and try to overwrite that field for influenza configs

I couldn't find an example, at least in CCHF of where we currently group segments with different virus tax ids.

Do you have concrete evidence for influenza where segments of the same virus have different taxon_ids? I might have been wrong when I suggested this and you might have taken my word for it. I'm not confident right now that this can indeed happen.

For CCHF, there are a few non-majority tax ids, but so far they are all only on a single segment:

❯ tsv-filter -H ~/Downloads/details\ \(18\).tsv --not-empty ncbi_virus_tax_id_S --str-not-in-fld ncbi_virus_tax_id_S:"3052518" | tsv-select -H -f ncbi_virus_tax_id_L,ncbi_virus_tax_id_M,ncbi_virus_tax_id_S | tsv-pretty 
ncbi_virus_tax_id_L  ncbi_virus_tax_id_M  ncbi_virus_tax_id_S
                                                       170517
                                                       170517
                                                       170517
                                                       170517
                                                       402373
                                                       402369
                                                       402370
                                                       402371
                                                       402372
                                                       402052

I definitely saw when I was running locally that the main reason for not grouping was different taxon id and that there were a lot of different taxon Ids - I will try to find an example.

@corneliusroemer
Copy link
Contributor

Woops, E2E fails.

Here's my promised PR to use tsv-utils again, I recommend reading the docs, it's a neat tool!

#2329

@anna-parker
Copy link
Contributor Author

As this PR now no longer changes how we handle taxon_id I will merge and we can discuss further on slack if setting this to perSegment is actually required in cases such as influenza

@anna-parker
Copy link
Contributor Author

I was going to merge but realized that cchf is duplicated on the preview - this might be our odd edge -case but I will relaunch and check if this happens again in an hour or so.

@anna-parker anna-parker added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd review please PR waiting for final review labels Jul 23, 2024
@corneliusroemer
Copy link
Contributor

corneliusroemer commented Jul 23, 2024 via email

@anna-parker
Copy link
Contributor Author

duplication happened again with ebola - but as cchf reingest is working ok I think this is a side effect of: #1777 and will merge this branch

@anna-parker anna-parker merged commit 52852d9 into main Jul 23, 2024
13 checks passed
@anna-parker anna-parker deleted the ingest_fixes branch July 23, 2024 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants