Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How notebooks will work in production #108

Closed
rdvelazquez opened this issue Jul 13, 2017 · 12 comments
Closed

How notebooks will work in production #108

rdvelazquez opened this issue Jul 13, 2017 · 12 comments

Comments

@rdvelazquez
Copy link
Member

@wisygig, @dcgoss and/or @dhimmel, what are your thoughts on how the Jupyter Notebook part of the application will work? I'm specifically interested in:

  • Where and how will the notebook be hosted/executed?
  • How will the back end interface with the notebooks?
  • How will the specific information from the user's query (genes and diseases) be inputted/updated in the notebook?
  • How will the specifics of the classifier be selected (what list of parameters to include in cross-validation, include or exclude covariates, and potentially in the future which classifier pipeline to use)? We have discussed automatically selecting some parameters (n_components) based on the query (Selecting the number of components returned by PCA #106) and we have also discussed letting the user select some parameters (l1_ratio) based on their preference (Selecting the number of components returned by PCA #106).
  • What are we thinking for the MVP... one notebook template to cover all queries or a number of different notebook templates for different situations?

I know this may be getting ahead of ourselves so feel free to differ this till later but I thought I'd at least mention that these topics are starting to come up. This issue spans a few different repos but I thought the machine-learning repo might be the best place for it... I'll also tag #63 from cognoma/cognoma.

@dcgoss
Copy link
Member

dcgoss commented Jul 13, 2017

Hey Ryan!
We are actually much further along than you think. Here is how the notebooks run:

  1. Frontend sends request to core-service (https://github.com/cognoma/core-service) to create a classifier.
  2. Classifier object is created, containing the genes and diseases selected by the user, as well as various system-generated metadata related to queueing (priority, completed_at, failed_at, etc).
  3. https://github.com/cognoma/ml-workers provides the container which will actually run the notebooks. Definitely check out this repo if you're interested. This container runs as a sort of daemon - it polls the /classifiers/queue endpoint of core-service and pulls off any open processing tasks off of that queue one by one.
  4. Once an ml-worker pulls a classifier off the queue, it executes the notebook. The worker sets an environment variable for the genes and diseases selected, and the notebook reads those environment variables. I have made some changes to 2.mutation-classifier.ipynb for this, including slicing the data so that only samples with the correct diseases are examined and so that expressions across multiple gene types can be checked for.
  5. Once the notebook finishes processing, ml-worker writes the output and then uploads the completed notebook file to the /classifiers/id/upload endpoint of core-service.
  6. core-service stores this file in AWS S3 and emails the user using AWS SES a link to their completed notebook.

All of this takes place in our AWS ECS cluster. core-service is currently deployed, however ml-workers currently is not. Looking to deploy ml-workers either today or tomorrow.

Let me know if you have any questions!

@rdvelazquez
Copy link
Member Author

Thanks for the update Derek! I'm glad i asked ; ) This is useful info. I'll take a look at ml-workers in more detail later to understand how it works and may get back to you with some questions. I'm loving the progress!

@wisygig
Copy link

wisygig commented Jul 19, 2017

@rdvelazquez , @dcgoss

How will the specific information from the user's query (genes and diseases) be inputted/updated in the notebook?

I've got a bit of code for creating templates from jupyter notebook files, as well as filling in certain keywords. I'll add a pull request to https://github.com/cognoma/ml-workers once I've finished a couple of use examples.

@rdvelazquez
Copy link
Member Author

That sounds good @wisygig. I'm looking forward to seeing how that works.

@dcgoss I haven't gotten a chance to look at ml-workers enough to really understand it but that was actually one of my questions/comments. It looks like you are inputting the query specific information with an environmental variable but this may break the portability of the notebook. I've only briefly looked at the code so just ignore that comment if it doesn't make sense ; )

@dcgoss
Copy link
Member

dcgoss commented Jul 19, 2017

@rdvelazquez That's correct, at the moment the notebook just uses environment variables so while you can print out those values at runtime they aren't actually hardcoded into the notebook itself. Looking forward to your PR @wisygig :)

@wisygig
Copy link

wisygig commented Jul 26, 2017

@dcgoss I've added the PR: cognoma/ml-workers#13
@rdvelazquez: I hope the example is useful. Let me know if there's anything unclear.

@rdvelazquez
Copy link
Member Author

This looks pretty cool! I skimmed through the code and I'm going to try to get it to work on my computer with my own example. I'll let you know if/when I have issues. and yes, the example (and README) are very useful.

@rdvelazquez
Copy link
Member Author

As cognoma.org is currently set up there seems to be no option to select a specific classifier/notebook. Are we thinking this will be the way cognoma works in production (i.e. only one notebook to handle all queries)?

@dhimmel you would know better than me but I think having the option to have multiple notebooks that the user could choose from would be ideal. This could potentially be built in an open ended way so that if, in the future, someone comes up with an interesting analysis, they could submit a pull request to have their notebook added to the list of options and the cognoma.org interface would then allow anyone to do similar analysis with different gene/disease combinations using that notebook template. The cognoma.org interface seems like too nice of a tool to limit to one specific notebook.

This is also of interest for how we handle certain choices (for lack of a better word) in the classifier implementation (number of PCA components, l1_ratio, test/train split size, etc.). I know some of these things could just be changed by hand in the notebook by the user but I wonder if that is the ideal solution, specifically if we want cognoma to be used by people with limited data science experience.

A similar and related topic is how, if at all, cognoma will handle queries that are not well suited to the analysis. for example, if a user selects a gene that has very few (or no) mutated samples for the selected diseases will cognoma.org run the notebook anyway? Raise an error on the webpage? A warning? Should the notebook raise errors/warnings?

My recommendations would be:

  1. If building the options for selecting from a list of notebooks is easy, implement that.
  2. If it is much easier to build the MVP to use only one notebook template, build the MVP that way but where possible, build it so that adding the ability to choose from multiple notebooks in the future is possible/easier.
  3. Regarding the warnings... It would be ideal if these warning were raised on the webpage (so that the user knows that the query won't run prior to submitting it) but likely much easier to include in the notebook (i.e. just add a cell in the notebook that checks the number of mutated samples and if needed prints, "The selected query contains less than 20 mutated samples, the selected analysis is not recommended for such situations" and stops the notebook).

@dcgoss
Copy link
Member

dcgoss commented Aug 1, 2017

@rdvelazquez core-service and ml-workers are built to accommodate different types of notebooks, however we would need to make a few small changes.
All classifiers have a title attribute which serves to indicate the type of notebook they are running (this field is currently useless as there is only one notebook). ml-worker can and does pull specific types of classifiers off the queue using this title field, however since there is only one type currently this doesn't do anything either.

The process to implement new notebook would be:

  • PR to ml-workers with new notebook and proposed title
  • provide and update endpoint in core-service with available titles, so that frontend can display options
  • include title in frontend request
  • ensure ml-workers reads title in classifier correctly and chooses appropriate notebook

@dhimmel
Copy link
Member

dhimmel commented Aug 10, 2017

having the option to have multiple notebooks that the user could choose from would be ideal

I agree this would be nice. Today was @dcgoss's last day of his summer internship... so I don't want to task him with a big backend change. Especially if we're not definitely going to need it. Perhaps the right approach would be to upgrade the backend to support multiple notebooks only once / if we have multiple production ready notebooks.

Raise an error on the webpage? A warning? Should the notebook raise errors/warnings?

The frontend should do some checks... like enforcing a minimum number of positives and a minimum number of negatives.

Ideally, we can prevent the notebook from erroring, by catching the failure modes before query submission. Warnings in the notebook make sense if we detect something that appears problematic.

@rdvelazquez
Copy link
Member Author

@dcgoss Congrats on finishing the internship... Productive summer!

@dhimmel Makes sense to keep it to just one notebook for now to limit the needed changes. There may be a chicken or egg dilemma (no one's going to make new notebooks if cognoma.org won't support them; cognoma.org won't support new notebooks if no one makes them) but we can cross that bridge if/when we get to it.

The frontend checks and warnings in the notebook also make sense.

Thanks for the responses!

@rdvelazquez
Copy link
Member Author

Closing this for now. @wisygig feel free to open this back up if you want to implement templating and want to revisit anything here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants