-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: describe ways to design the schema #2409
Conversation
f7d5ae5
to
1bf1d2b
Compare
1bf1d2b
to
a369c82
Compare
@theosanderson, @anna-parker, thanks for your reviews! I adapted the text, could you please check whether this is better? |
|
||
### Multiple references for an organism | ||
|
||
An organism has one unaligned sequence (per segment) but multiple aligned ones. Users submit an unaligned nucleotide sequence and the processing pipeline aligns it against all multiple references. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe still add this would require a custom prepro pipeline - like this it sounds like prepro could do this now and it can't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have low confidence that this would work at all atm even with a custom pipeline (I could imagine that it might but I'm just not remotely confident, given we haven't tried it and given that often when one first tries something one finds a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a general note that the development of a preprocessing pipeline might be needed. I didn't make it model-specific because I would consider the need to develop a preprocessing pipeline as very "normal". We provide one opinionated pipeline with a few features but in many cases, I imagine that someone would want to develop a new pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give an example of what I mean: presumably the custom-built preprocessing pipeline would take as input e.g. the L
segment (which would be the unaligned sequence), and would have the name L
and would output two alignments LalignedToSequenceX
and LalignedToSequenceY
(not necessarily those names, but some name like them). But it looks like our code for displaying aligned sequences iterates over the same nucleotideSegmentNames
for both aligned and unaligned sequences:
{nucleotideSegmentNames.map((segmentName) => ( |
IMO we can't suggest this model while that's the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What one can do is to have all three segments: L
, LalignedToSequenceX
and LalignedToSequenceY
. Users only upload sequences to the L
segment whereas the pipeline will generate aligned sequences for the other two. The UI at this moment wouldn't be very nice but once we fix #2376, it should be mostly alright. (Still not perfect as in the download panel, one would be able to select "Aligned L" and "Unaligned LalignedToSequenceX" but I wouldn't consider it as a blocker.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting idea. But if we do that (if I'm understanding correctly), then our UI will show up like this for submission:
and fairly soon I think we would like to add submission through a form as an option, and that will then have a box for people to paste their LalignedtoX
sequence. Actually potentially our edit form might already show this? (not sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not ideal but whether this is acceptable depends a lot on the use case in my opinion. For a public-facing instance with many external users, this is probably problematic. For a lab-internal instance or if someone would like to build a dashboard or run analysis on top of LAPIS and use Loculus just for storing and preprocessing the data (as we do in GenSpectrum), this might be entirely fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! I think my issue is that we don't really communicate to the user that this model is a hack that may work, which we haven't tested, whereas the "default" model we have tried a lot. (We now have a caveat that we haven't tried all the models, but we don't say which). I expect in the future we will actually design features around the ability to do this, at which point the configuration would distinguish between the input sequences and the aligned sequences, so we wouldn't have these issues. I made #2433 to try to make clear to users that we're not in that place yet, if that makes sense
Co-authored-by: Theo Sanderson <[email protected]>
Thanks for the feedback, @anna-parker and @theosanderson! I'll merge this now but I agree that it's not perfect and am happy to continue improving it. |
I reused a Slack message I sent a while ago and added a page to the docs to describe different ways to model the schema of a Loculus instance. Very happy to receive suggestions for improvement!