Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add caveats about some schema models #2433

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

theosanderson
Copy link
Member

@theosanderson theosanderson commented Aug 14, 2024

We've been having some discussions in #2409 about descriptions of alternative schemas. I think my main concern isn't that we mention them but that we don't provide enough context to users about how tested they each are and how well they are expected to work. This tries to address that.

image

@theosanderson
Copy link
Member Author

theosanderson commented Aug 14, 2024

Old version with warnings that we are no longer using:

image

@chaoran-chen
Copy link
Member

I am happy with the note regarding the confusing UI elements but I don't think that we should warn about the need to develop a pipeline. If someone wants to use UShER to call pango lineages or Dengue-GLUE to call dengue lineages or just want to do something with metadata that the Nextclade pipeline does not support, they also have to develop a new pipeline. Instead of warning people, I would actually want to encourage people to build new pipelines.

@theosanderson
Copy link
Member Author

I think this comes in part from my experience working with someone to set up a Loculus instance for another organism. It was quite a hard involved process, with a lot of troubleshooting, despite a Nextclade organism already existing for their organism and them fitting into our default model! I think anyone who wants to use Loculus should probably start off by setting up an instance with an off-the-shelf pipeline (as soon as we can, we should provide an off the shelf pipeline for the no-alignment case, which could be what they would use). Once they are confident about how to do that they can move onto writing a pipeline. Trying to write a pipeline as part of deploying one's very first instance would be a really really hard learning curve - you wouldn't know if problems were due to the general configuration or the pipeline. I think we need to communicate that one of these things is a lot harder than the other.

@theosanderson
Copy link
Member Author

And essentially, some people may want to write their own custom pipelines, and some people may want to use a Nextclade-based pipeline, and so people should know what their schema choice will mean about which of those options are available to them, IMO

@chaoran-chen
Copy link
Member

chaoran-chen commented Aug 15, 2024

I think anyone who wants to use Loculus should probably start off by setting up an instance with an off-the-shelf pipeline

I think for training/learning about Loculus, we should create a tutorial where one would set up a new instance for a specific pathogen and use an existing pipeline (and more tutorials on how to get more advanced and write new pipelines). For a real instance, however, it depends on the actual use case and requirements whether an existing pipeline is sufficient or not.

And essentially, some people may want to write their own custom pipelines, and some people may want to use a Nextclade-based pipeline, and so people should know what their schema choice will mean about which of those options are available to them, IMO

There are many factors that determine whether our Nextclade pipeline works or not, so I don't think that this should be specifically added to the schema page. What about creating a new page with a list of reusable/well-maintained preprocessing pipelines that will initially only consist of our Nextclade pipeline but can later easily extended with other pipelines from the community? We will add a short description to each pipeline about the features they support. For a instance, you can then check the list and determine whether an existing pipeline is likely to work or not. For the Nextclade pipeline, we can explictly say that it only supports the first example schema model at this moment.

In other words, I think that it is a "limitation" of the Nextclade pipeline (limitation in quotes as we may simply consider it to be out-of-scope and that's entirely fine) that it does not support certain schema models, and not a limitation of the schema models/core Loculus.

Copy link
Contributor

@anna-parker anna-parker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like adding this warning for the multiple references for an organism - maybe we could say we are happy to help if people have issues if we want to encourage use but I also agree we should let people know because writing a preprocessing pipeline for loculus from scratch atm requires a lot of knowledge about the structure of expected input and output for the backend and website and is a lot of work.

I think we might be able to remove the warning for No alignments though and just say they can use the dummy prepro pipeline which essentially does nothing. (unless we think this is also not tested enough - in which case Im also fine with a warning)

@chaoran-chen
Copy link
Member

I know that writing a pipeline takes effort, and setting Loculus up in general takes effort, but these are not intrinsic to the schema models, so I think that a warning in this document is inappropriate. It makes much more sense to me to mention it in a document about pipelines.

@theosanderson
Copy link
Member Author

theosanderson commented Aug 15, 2024

I think for training/learning about Loculus, we should create a tutorial where one would set up a new instance for a specific pathogen and use an existing pipeline (and more tutorials on how to get more advanced and write new pipelines).

Very much agreed!

For a real instance, however, it depends on the actual use case and requirements whether an existing pipeline is sufficient or not.

Agreed, although I think the Nextclade pipelines will be widely applicable, at least as first versions.

What about creating a new page with a list of reusable/well-maintained preprocessing pipelines that will initially only consist of our Nextclade pipeline but can later easily extended with other pipelines from the community?

This sounds good!

I know that writing a pipeline takes effort, and setting Loculus up in general takes effort, but these are not intrinsic to the schema models, so I think that a warning in this document is inappropriate.

There is currently a property of the schema that some schemas are supported by already-written pipelines and others are not. This would definitely feed into my decision as to what schema to pick, and I think that's a totally reasonable consideration. (Yes, sometimes I wouldn't have a choice about what schema to pick, but other times I would). It would also give me realistic expectations of what to expect - e.g. if I'd just followed the tutorial where I used a pre-written pipeline and everything was easy. Currently the "Getting started" docs say "the first thing to do is to pick a schema", with a link to this page. When people are making that decision I think this is relevant info.

I will hold off merging this for the moment and maybe we can chat through in the next meeting and get more feedback.

@emmahodcroft
Copy link
Member

I think my general take is:

It's good to ensure people have a good understanding of where they would have more or less support at any decision points in setting up Loculus. I think pre-processing is an important part of this.

I don't think that information needs to be negative, just factual. I am not even sure if we need 'Note' boxes - just something (maybe just text) at the top of bottom that says (perhaps for each of the examples if we want to be 'equal):

  • Examples of this approach exist using the Nextclade preprocessing pipeline and user interface, and this may be usable for other similar set-ups.
  • This approach has not yet been specifically implemented with an existing preprocessing pipeline and/or user interface.

(I'm on the fence about including the UI bit, could take that out)

I think it would also be appropriate to have something in the description of the Nextclade preprocessing pipeline description that highlights that while "existing datasets exists for some pathogens, not all have them, so you might have to develop your own (Link to Nextclade list or something)". I couldn't tell if we already have this or not.

I just think this may help people to shape how much & what kind of work it would be to set up a specific type of instance, so they can figure out a balance between the resources they have and what they want.

@chaoran-chen
Copy link
Member

I just open a PR for an alternative suggestion: #2450

@theosanderson theosanderson force-pushed the docs--add-caveats-about-some-schema-models branch from de01fe9 to 8cf67df Compare August 19, 2024 14:34
@theosanderson theosanderson added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Aug 19, 2024
@corneliusroemer corneliusroemer added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants