-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add caveats about some schema models #2433
base: main
Are you sure you want to change the base?
Conversation
I am happy with the note regarding the confusing UI elements but I don't think that we should warn about the need to develop a pipeline. If someone wants to use UShER to call pango lineages or Dengue-GLUE to call dengue lineages or just want to do something with metadata that the Nextclade pipeline does not support, they also have to develop a new pipeline. Instead of warning people, I would actually want to encourage people to build new pipelines. |
I think this comes in part from my experience working with someone to set up a Loculus instance for another organism. It was quite a hard involved process, with a lot of troubleshooting, despite a Nextclade organism already existing for their organism and them fitting into our default model! I think anyone who wants to use Loculus should probably start off by setting up an instance with an off-the-shelf pipeline (as soon as we can, we should provide an off the shelf pipeline for the no-alignment case, which could be what they would use). Once they are confident about how to do that they can move onto writing a pipeline. Trying to write a pipeline as part of deploying one's very first instance would be a really really hard learning curve - you wouldn't know if problems were due to the general configuration or the pipeline. I think we need to communicate that one of these things is a lot harder than the other. |
And essentially, some people may want to write their own custom pipelines, and some people may want to use a Nextclade-based pipeline, and so people should know what their schema choice will mean about which of those options are available to them, IMO |
I think for training/learning about Loculus, we should create a tutorial where one would set up a new instance for a specific pathogen and use an existing pipeline (and more tutorials on how to get more advanced and write new pipelines). For a real instance, however, it depends on the actual use case and requirements whether an existing pipeline is sufficient or not.
There are many factors that determine whether our Nextclade pipeline works or not, so I don't think that this should be specifically added to the schema page. What about creating a new page with a list of reusable/well-maintained preprocessing pipelines that will initially only consist of our Nextclade pipeline but can later easily extended with other pipelines from the community? We will add a short description to each pipeline about the features they support. For a instance, you can then check the list and determine whether an existing pipeline is likely to work or not. For the Nextclade pipeline, we can explictly say that it only supports the first example schema model at this moment. In other words, I think that it is a "limitation" of the Nextclade pipeline (limitation in quotes as we may simply consider it to be out-of-scope and that's entirely fine) that it does not support certain schema models, and not a limitation of the schema models/core Loculus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like adding this warning for the multiple references for an organism - maybe we could say we are happy to help if people have issues if we want to encourage use but I also agree we should let people know because writing a preprocessing pipeline for loculus from scratch atm requires a lot of knowledge about the structure of expected input and output for the backend and website and is a lot of work.
I think we might be able to remove the warning for No alignments though and just say they can use the dummy prepro pipeline which essentially does nothing. (unless we think this is also not tested enough - in which case Im also fine with a warning)
I know that writing a pipeline takes effort, and setting Loculus up in general takes effort, but these are not intrinsic to the schema models, so I think that a warning in this document is inappropriate. It makes much more sense to me to mention it in a document about pipelines. |
Very much agreed!
Agreed, although I think the Nextclade pipelines will be widely applicable, at least as first versions.
This sounds good!
There is currently a property of the schema that some schemas are supported by already-written pipelines and others are not. This would definitely feed into my decision as to what schema to pick, and I think that's a totally reasonable consideration. (Yes, sometimes I wouldn't have a choice about what schema to pick, but other times I would). It would also give me realistic expectations of what to expect - e.g. if I'd just followed the tutorial where I used a pre-written pipeline and everything was easy. Currently the "Getting started" docs say "the first thing to do is to pick a schema", with a link to this page. When people are making that decision I think this is relevant info. I will hold off merging this for the moment and maybe we can chat through in the next meeting and get more feedback. |
I think my general take is: It's good to ensure people have a good understanding of where they would have more or less support at any decision points in setting up Loculus. I think pre-processing is an important part of this. I don't think that information needs to be negative, just factual. I am not even sure if we need 'Note' boxes - just something (maybe just text) at the top of bottom that says (perhaps for each of the examples if we want to be 'equal):
(I'm on the fence about including the UI bit, could take that out) I think it would also be appropriate to have something in the description of the Nextclade preprocessing pipeline description that highlights that while "existing datasets exists for some pathogens, not all have them, so you might have to develop your own (Link to Nextclade list or something)". I couldn't tell if we already have this or not. I just think this may help people to shape how much & what kind of work it would be to set up a specific type of instance, so they can figure out a balance between the resources they have and what they want. |
I just open a PR for an alternative suggestion: #2450 |
de01fe9
to
8cf67df
Compare
We've been having some discussions in #2409 about descriptions of alternative schemas. I think my main concern isn't that we mention them but that we don't provide enough context to users about how tested they each are and how well they are expected to work. This tries to address that.