-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize predict processes for ML models #396
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of notes and suggestions.
However I'm not a big ML expert either, so it would be good to collect some more input from other reviewers that do ML on a daily basis
"id": "predict_random_forest", | ||
"summary": "Predict values based on a Random Forest model", | ||
"description": "Applies a Random Forest machine learning model to an array and predict a value for it.", | ||
"id": "predict_ml_model", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predict_ml_model
looks a bit weird to me, it reads like you will be predicting the model (which is a confusing statement), instead of letting the model predict classes.
Something like predict_class
, ml_model_predict
or even ml_predict
would feel better.
Note that we for the array, text, date related processes, we also use a prefix based naming (array_append
, array_apply
, date_shift
) instead of a postfix based naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My aim was to align with predict_curve
so that in docs they would be listed side by side. Unfortunately, we are not very consistent with prefixes/suffixes (array_apply / date_shift / load_collection / save_result / reduce_dimension)...
Is there any other case where predict_class / predict_probabilities could become useful without ML? That was the main reason I added ml_model in the first place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing with the name predict_class
is that that process would not only work for ML classification, but also ML regression, so ideally, the term class
should be avoided. Unless there is a good reason to have separate inference/predict processes for ML classification and ML regression, but as far as I know that's not the case.
Is there any other case where predict_class / predict_probabilities could become useful without ML? That was the main reason I added ml_model in the first place...
To me, "predict" implies "machine learning", so having both "ml" and "predict" in the name is slightly redundant.
However, it might be better for discoverability and self-documenting reasons to keep that bit of redundancy.
@@ -0,0 +1,45 @@ | |||
{ | |||
"id": "predict_ml_model_probabilities", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like above, this process name looks a bit weird to me.
I'd prefer something like predict_probabilities
, ml_model_predict_probabilities
or ml_predict_probabilities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about the recent discussion (prefix is based on the primary input), it should probably be ml_predict_probabilities
(or ml_model_predict_probabilities
which is rather long)
} | ||
], | ||
"returns": { | ||
"description": "The predicted (class) probabilities. Returns `null` if any of the given values in the array is a no-data value.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"description": "The predicted (class) probabilities. Returns `null` if any of the given values in the array is a no-data value.", | |
"description": "The predicted class probabilities. Returns `null` if any of the given values in the array is a no-data value.", |
I'm not sure about the Returns null if any ...
: that depends on the capability of the ML model to handle null/nodata I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually tricky - some thoughts:
- E.g. in deep learning if I've anticipated missing data, I could have used
torch.nan_to_num
to replace nan with a carefully chosen value in both training and inference. If I then export that model with e.g. ONNX, that transformation will be baked in and my model will know how to deal with missing data. I'm not sure how every other frameworks handles this (e.g. sklearn or xgboost) - but it's super important that this value is the same during inference as it was during training! - If the model handles nans, and the
predict_ml_model
process just ignores nan-values and lets it through, then I get the correct behaviour. - However, if it fails on nan-values, then the only way the user can fix this is to replace nan-values with the correct value in an extra openeo process before. To do that I need to know what that value was during training - this might not be easily available!
- If the model doesn't handle nans (either they didn't add nan-handling to inference code, or just didn't train on samples with nans at all, etc.), then what would happen really depends on the framework. It might just crash, or it might run through, with the nan values subtly impacting the predictions.
- some options I can think of:
- Make this a parameter of the process (
fail_on_nan
or similar), fail by default and allow overriding if the user knows that their model can handle nans out of the box or is willing to throw the dice. - Add information about what to replace nan values with to the
ml-model
STAC extension and use that if available
- Make this a parameter of the process (
I think as a user my preference would be for this process to assume that the model has already been constructed to handle nans correctly in both training and inference and therefore doesn't try to interfere.
Hope this is useful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, you agree that it would be better to drop
Returns
null
if any of the given values in the array is a no-data value.
from the general description of predict_..._probabilities
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, you agree that it would be better to drop
Returns
null
if any of the given values in the array is a no-data value.from the general description of
predict_..._probabilities
?
Yeah, exactly!
Another side note (not necessarily to tackle in this PR): |
Potentially interesting for "bring your own model": https://onnx.ai/ |
We continue in #441 - We still need to bring some of the comments over. |
As discussed in #368, a first draft to generalize the former random forest specific ML processes for predictions into one process for simple class predictions and another process for class probabilities. Please check carefully, I don't have an ML background ;-)