Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: describe ways to design the schema #2409

Merged
merged 4 commits into from
Aug 14, 2024
Merged

Conversation

chaoran-chen
Copy link
Member

I reused a Slack message I sent a while ago and added a page to the docs to describe different ways to model the schema of a Loculus instance. Very happy to receive suggestions for improvement!

@chaoran-chen
Copy link
Member Author

@theosanderson, @anna-parker, thanks for your reviews! I adapted the text, could you please check whether this is better?


### Multiple references for an organism

An organism has one unaligned sequence (per segment) but multiple aligned ones. Users submit an unaligned nucleotide sequence and the processing pipeline aligns it against all multiple references.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe still add this would require a custom prepro pipeline - like this it sounds like prepro could do this now and it can't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have low confidence that this would work at all atm even with a custom pipeline (I could imagine that it might but I'm just not remotely confident, given we haven't tried it and given that often when one first tries something one finds a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a general note that the development of a preprocessing pipeline might be needed. I didn't make it model-specific because I would consider the need to develop a preprocessing pipeline as very "normal". We provide one opinionated pipeline with a few features but in many cases, I imagine that someone would want to develop a new pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give an example of what I mean: presumably the custom-built preprocessing pipeline would take as input e.g. the L segment (which would be the unaligned sequence), and would have the name L and would output two alignments LalignedToSequenceX and LalignedToSequenceY (not necessarily those names, but some name like them). But it looks like our code for displaying aligned sequences iterates over the same nucleotideSegmentNames for both aligned and unaligned sequences:

{nucleotideSegmentNames.map((segmentName) => (

IMO we can't suggest this model while that's the case

Copy link
Member Author

@chaoran-chen chaoran-chen Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What one can do is to have all three segments: L, LalignedToSequenceX and LalignedToSequenceY. Users only upload sequences to the L segment whereas the pipeline will generate aligned sequences for the other two. The UI at this moment wouldn't be very nice but once we fix #2376, it should be mostly alright. (Still not perfect as in the download panel, one would be able to select "Aligned L" and "Unaligned LalignedToSequenceX" but I wouldn't consider it as a blocker.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting idea. But if we do that (if I'm understanding correctly), then our UI will show up like this for submission:
image

and fairly soon I think we would like to add submission through a form as an option, and that will then have a box for people to paste their LalignedtoX sequence. Actually potentially our edit form might already show this? (not sure)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not ideal but whether this is acceptable depends a lot on the use case in my opinion. For a public-facing instance with many external users, this is probably problematic. For a lab-internal instance or if someone would like to build a dashboard or run analysis on top of LAPIS and use Loculus just for storing and preprocessing the data (as we do in GenSpectrum), this might be entirely fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! I think my issue is that we don't really communicate to the user that this model is a hack that may work, which we haven't tested, whereas the "default" model we have tried a lot. (We now have a caveat that we haven't tried all the models, but we don't say which). I expect in the future we will actually design features around the ability to do this, at which point the configuration would distinguish between the input sequences and the aligned sequences, so we wouldn't have these issues. I made #2433 to try to make clear to users that we're not in that place yet, if that makes sense

@chaoran-chen
Copy link
Member Author

Thanks for the feedback, @anna-parker and @theosanderson! I'll merge this now but I agree that it's not perfect and am happy to continue improving it.

@chaoran-chen chaoran-chen merged commit cec2fa9 into main Aug 14, 2024
10 checks passed
@chaoran-chen chaoran-chen deleted the docs-organism-modeling branch August 14, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants