Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add opt-in rich fasta headers with metadata #3448

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

corneliusroemer
Copy link
Contributor

@corneliusroemer corneliusroemer commented Dec 16, 2024

resolves #2284

preview URL: https://fasta-header.loculus.org

Summary

Allow unaligned sequences to be downloaded with a rich FASTA header (configured at organism level for simplicity), e.g. displayName instead of accessionVersion.

This requires creating a new astro API endpoint, for now /[organism]/api/sequences because LAPIS does not want to implement a dynamic FASTA header feature. This endpoint fetches metadata and sequences separately and joins them before responding with the result.

To prevent excessive memory consumption, there's a hardcoded limit of 5k sequences that can be accessed via this endpoint (enforced via an initial request to /aggregated).

Compression is not supported for this endpoint but this is currently not worth the effort (rich header is for ad hoc analysis/viewing not for download of full data).

Screenshot

2024-12-16 23 10 13

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.

@corneliusroemer corneliusroemer added the preview Triggers a deployment to argocd label Dec 16, 2024
@corneliusroemer corneliusroemer marked this pull request as ready for review December 17, 2024 14:26
@theosanderson
Copy link
Member

Thanks for looking into this. I just tried it in the preview at https://fasta-header.loculus.org/ebola-sudan/search? , and it didn't seem to yield a difference?

I don't know if you saw, but I hacked something similar together at https://github.com/theosanderson/metadata-sequence-combiner to allow #3439 .

I'm a bit uncertain about whether we should add this to Loculus when it only support <5000 sequences. I do think it would be possible to stream from LAPIS from metadata and sequences and combine them on the fly in a stream, scaling to an unlimited number of sequences. I think my vote might be to keep this out of Loculus until we are able to do that, and rely on hacks on Pathoplexus to allow Nextclade and other integrations - which I think are the main use case atm?

Multi segmented

Refactor

Guard against excessive request size to prevent memory spikes

Refactor

Aadd UI for rich headers

Make configurable

f

proper base name

rich headers don't support compression
@corneliusroemer
Copy link
Contributor Author

I hadn't drilled through the new value into the website config (locally I'd just edited the config json directly), should be fixed now.

I'm a bit uncertain about whether we should add this to Loculus when it only support <5000 sequences. I do think it would be possible to stream from LAPIS from metadata and sequences and combine them on the fly in a stream, scaling to an unlimited number of sequences. I think my vote might be to keep this out of Loculus until we are able to do that, and rely on hacks on Pathoplexus to allow Nextclade and other integrations - which I think are the main use case atm?

I think I disagree on the first part (not doing it unless streaming) because imperfect but working is better than not having it all and if there's enough demand one can add streaming later. I see it like that: Right now one can download only 0 sequences with rich header, now one can do 5k, clear improvement, 90+% of total benefit achieved.

I do really like the tools UI/UX you've come up with, very smooth. I'm just a bit worried about deviating from our monorepo pattern for something that can be considered a core feature. But nothing stops us from going down the tools route and then insourcing it later, so happy either wray.

@rneher
Copy link

rneher commented Jan 19, 2025

I do think this is useful. Ideally we would allow custom composition of the header. This could be a little button that opens dialog where one can pick separator and metadata field selection/order.

@theosanderson
Copy link
Member

Removing preview for now due to being out of previews

@theosanderson theosanderson removed the preview Triggers a deployment to argocd label Feb 10, 2025
<RadioOptionBlock
name='richFastaHeaders'
title='FASTA header style'
options={[{ label: <>Accession only</> }, { label: <>Accession and metadata</> }]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe make the option Accession or Display Name (as we do label the display name as display name on the sequence view pages and which metadata fields are added is ambiguous)

@theosanderson
Copy link
Member

I updated my version of this endpoint with a new version that streams the data and so should scale indefinitely (as long as we let the Astro endpoint have as long as it needs). Here is a URL that demonstrates how it is used.

So we could adapt my approach to mash with what Cornelius has done here if we wanted to implement this?

@anna-parker
Copy link
Contributor

anna-parker commented Feb 12, 2025

I updated my version of this endpoint with a new version that streams the data and so should scale indefinitely (as long as we let the Astro endpoint have as long as it needs). Here is a URL that demonstrates how it is used.

So we could adapt my approach to mash with what Cornelius has done here if we wanted to implement this?

Omg this looks amazing Theo!! I would vote for adding this into Loculus if that is possible as I do think this is quite useful (wasn't 100% sure if your suggestion was to update this PR to point to your endpoint or to implement your endpoint here)

@emmahodcroft
Copy link
Member

Technically I can't comment much but having a version that scales seems like a nice idea as we won't have to come whack this mole again later. I assume that for anything streaming the compression is not really possible so this is a clear trade off, but one that's ok.
It seems like this is likely largely aligned now, at least in a minimal version, and we can add customization to the metadata header later?

@fhennig fhennig self-assigned this Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants