-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ena-submission): Create insdc submission service #2186
Conversation
8ac82cb
to
6ace49f
Compare
a6f600c
to
0bf88ad
Compare
The ena-submission pods are stuck in |
(just adding it back to take a look) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have another pass over things tomorrow –for now just to say that in addition to my nitpicks this is really exciting and looks great
One additional thought: the get-released-data endpoint returns the entire alignment as well as metadata fields. This is a lot of data to process. As the ena-submission-pod anyways has database access - do think it would be better if the ena-submission-pod reads the main view of the loculus DB and performs these queries there? I am not sure if this would be bad design for a microservice but I could see it being a big performance increase. Tagging @chaoran-chen for this |
Summary of offline discussion: Querying the DB directly is indeed discouraged for microservices. Instead we can use the sample/details endpoint in LAPIS (https://lapis-main.loculus.org/ebola-sudan) to get all specified metadata fields of sequences which match given features: e.g.
|
I don't like the idea of going via LAPIS, it adds a whole bunch of intermediary steps, latency, potential of corruption/failure etc. Regarding potential inefficiency of calling
Why avoid LAPIS?
|
Just discussed this with @corneliusroemer offline. The backend is our source of truth wrt state of sequences and release state. Querying LAPIS instead of the backend for the state of sequences adds an additional layer of complexity and possible issues. If get-released-data does in fact turn into a bottleneck we can optimize the code then - adding a filtering option to the get-released-data endpoint might be a better choice than querying LAPIS. For now I think it makes most sense for me to update the code to filter out sequences from the get-released-data response (and check for "dataUseTerms": "OPEN" as I missed this). I will create a follow up issue to look into optimizing this code. @chaoran-chen please let me know if you have any issues with this - we can also discuss more tomorrow as this seems like an important design issue :-) |
Okay, sounds good to me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this
resolves #2264
preview URL: https://ena-submission-pod.loculus.org/
Summary
This is the first step in creating an INSDC submission service. This pod has the authorization of
external_metadata_updater
.At pod initialization:
ena-submission
inside the loculus database. This schema will include 4 tables: 1 with general state of submission to ENA, and then one for each stage of the submission process: project, sample and assembly creation.After initialization is complete the pod runs
snakemake get_ena_submission_list
: For the moment this just calls the get-released-data endpoint for data in state APPROVED_FOR_RELEASE. The data is then filtered:This PR also contains the rule
submit_external_metadata
:Screenshot
PR Checklist
For future PRs: (these should also be performed in a loop to check for changing state)
project_tables
sample_tables
.assembly_tables
.