Generalize predict processes for ML models #396

m-mohr · 2022-11-28T15:55:11Z

As discussed in #368, a first draft to generalize the former random forest specific ML processes for predictions into one process for simple class predictions and another process for class probabilities. Please check carefully, I don't have an ML background ;-)

soxofaan

A couple of notes and suggestions.
However I'm not a big ML expert either, so it would be good to collect some more input from other reviewers that do ML on a daily basis

soxofaan · 2022-11-30T08:38:54Z

proposals/predict_ml_model.json

-    "id": "predict_random_forest",
-    "summary": "Predict values based on a Random Forest model",
-    "description": "Applies a Random Forest machine learning model to an array and predict a value for it.",
+    "id": "predict_ml_model",


predict_ml_model looks a bit weird to me, it reads like you will be predicting the model (which is a confusing statement), instead of letting the model predict classes.

Something like predict_class, ml_model_predict or even ml_predict would feel better.

Note that we for the array, text, date related processes, we also use a prefix based naming (array_append, array_apply, date_shift) instead of a postfix based naming.

My aim was to align with predict_curve so that in docs they would be listed side by side. Unfortunately, we are not very consistent with prefixes/suffixes (array_apply / date_shift / load_collection / save_result / reduce_dimension)...

Is there any other case where predict_class / predict_probabilities could become useful without ML? That was the main reason I added ml_model in the first place...

The thing with the name predict_class is that that process would not only work for ML classification, but also ML regression, so ideally, the term class should be avoided. Unless there is a good reason to have separate inference/predict processes for ML classification and ML regression, but as far as I know that's not the case.

Is there any other case where predict_class / predict_probabilities could become useful without ML? That was the main reason I added ml_model in the first place...

To me, "predict" implies "machine learning", so having both "ml" and "predict" in the name is slightly redundant.
However, it might be better for discoverability and self-documenting reasons to keep that bit of redundancy.

proposals/predict_ml_model.json

soxofaan · 2022-11-30T08:45:46Z

proposals/predict_ml_model_probabilities.json

@@ -0,0 +1,45 @@
+{
+    "id": "predict_ml_model_probabilities",


Like above, this process name looks a bit weird to me.
I'd prefer something like predict_probabilities, ml_model_predict_probabilities or ml_predict_probabilities

Thinking about the recent discussion (prefix is based on the primary input), it should probably be ml_predict_probabilities (or ml_model_predict_probabilities which is rather long)

proposals/predict_ml_model_probabilities.json

soxofaan · 2022-11-30T08:49:28Z

proposals/predict_ml_model_probabilities.json

+        }
+    ],
+    "returns": {
+        "description": "The predicted (class) probabilities. Returns `null` if any of the given values in the array is a no-data value.",


Suggested change

"description": "The predicted (class) probabilities. Returns `null` if any of the given values in the array is a no-data value.",

"description": "The predicted class probabilities. Returns `null` if any of the given values in the array is a no-data value.",

I'm not sure about the Returns null if any ...: that depends on the capability of the ML model to handle null/nodata I think

This is actually tricky - some thoughts:

E.g. in deep learning if I've anticipated missing data, I could have used torch.nan_to_num to replace nan with a carefully chosen value in both training and inference. If I then export that model with e.g. ONNX, that transformation will be baked in and my model will know how to deal with missing data. I'm not sure how every other frameworks handles this (e.g. sklearn or xgboost) - but it's super important that this value is the same during inference as it was during training!

If the model handles nans, and the predict_ml_model process just ignores nan-values and lets it through, then I get the correct behaviour.

However, if it fails on nan-values, then the only way the user can fix this is to replace nan-values with the correct value in an extra openeo process before. To do that I need to know what that value was during training - this might not be easily available!

If the model doesn't handle nans (either they didn't add nan-handling to inference code, or just didn't train on samples with nans at all, etc.), then what would happen really depends on the framework. It might just crash, or it might run through, with the nan values subtly impacting the predictions.

some options I can think of:

Make this a parameter of the process (fail_on_nan or similar), fail by default and allow overriding if the user knows that their model can handle nans out of the box or is willing to throw the dice.

Add information about what to replace nan values with to the ml-model STAC extension and use that if available

I think as a user my preference would be for this process to assume that the model has already been constructed to handle nans correctly in both training and inference and therefore doesn't try to interfere.

Hope this is useful!

If I understand correctly, you agree that it would be better to drop

Returns null if any of the given values in the array is a no-data value.

from the general description of predict_..._probabilities ?

If I understand correctly, you agree that it would be better to drop

Returns null if any of the given values in the array is a no-data value.

from the general description of predict_..._probabilities ?

Yeah, exactly!

soxofaan · 2022-11-30T09:02:59Z

Another side note (not necessarily to tackle in this PR):
Ultimately, I think, predict_curve becomes obsolete in favor of what is currently named predict_ml_model (considering that fit_curve basically returns just another ML model)

m-mohr · 2023-03-16T14:28:00Z

Potentially interesting for "bring your own model": https://onnx.ai/

m-mohr · 2023-05-15T12:05:00Z

We continue in #441 - We still need to bring some of the comments over.

Make predict processes for ML more general #368

d49d306

m-mohr added new process ML labels Nov 28, 2022

m-mohr added this to the 1.3.0 milestone Nov 28, 2022

m-mohr requested review from soxofaan, LukeWeidenwalker, clausmichele and dthiex November 28, 2022 15:55

m-mohr linked an issue Nov 28, 2022 that may be closed by this pull request

predict_class and predict_probabilities #368

Open

soxofaan reviewed Nov 30, 2022

View reviewed changes

m-mohr modified the milestones: 1.3.0, 2.0.0 Feb 1, 2023

m-mohr added the must-have label Feb 1, 2023

m-mohr mentioned this pull request Mar 10, 2023

Check proposal status of processes #301

Open

m-mohr modified the milestones: 2.0.0, 2.1.0 Mar 10, 2023

soxofaan mentioned this pull request Mar 16, 2023

fit_curve return schema #425

Closed

m-mohr mentioned this pull request May 15, 2023

Machine Learning for openEO #441

Open

m-mohr closed this May 15, 2023

m-mohr added a commit that referenced this pull request May 16, 2023

Wording improvements from #396

0696107

m-mohr added a commit that referenced this pull request May 16, 2023

Wording improvements from #396

6baf737

m-mohr added a commit that referenced this pull request May 16, 2023

Rename processes according to recent discussions from #396

98fda68

m-mohr added a commit that referenced this pull request May 16, 2023

Rename processes according to recent discussions from #396

1fc4a8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize predict processes for ML models #396

Generalize predict processes for ML models #396

m-mohr commented Nov 28, 2022

soxofaan left a comment

soxofaan Nov 30, 2022

m-mohr Nov 30, 2022 •

edited

Loading

soxofaan Nov 30, 2022

soxofaan Nov 30, 2022

m-mohr Mar 16, 2023 •

edited

Loading

soxofaan Nov 30, 2022

LukeWeidenwalker Dec 1, 2022 •

edited

Loading

soxofaan Dec 5, 2022

LukeWeidenwalker Dec 5, 2022

soxofaan commented Nov 30, 2022

m-mohr commented Mar 16, 2023

m-mohr commented May 15, 2023

	"description": "The predicted (class) probabilities. Returns `null` if any of the given values in the array is a no-data value.",
	"description": "The predicted class probabilities. Returns `null` if any of the given values in the array is a no-data value.",

Generalize predict processes for ML models #396

Generalize predict processes for ML models #396

Conversation

m-mohr commented Nov 28, 2022

soxofaan left a comment

Choose a reason for hiding this comment

soxofaan Nov 30, 2022

Choose a reason for hiding this comment

m-mohr Nov 30, 2022 • edited Loading

Choose a reason for hiding this comment

soxofaan Nov 30, 2022

Choose a reason for hiding this comment

soxofaan Nov 30, 2022

Choose a reason for hiding this comment

m-mohr Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

soxofaan Nov 30, 2022

Choose a reason for hiding this comment

LukeWeidenwalker Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

soxofaan Dec 5, 2022

Choose a reason for hiding this comment

LukeWeidenwalker Dec 5, 2022

Choose a reason for hiding this comment

soxofaan commented Nov 30, 2022

m-mohr commented Mar 16, 2023

m-mohr commented May 15, 2023

m-mohr Nov 30, 2022 •

edited

Loading

m-mohr Mar 16, 2023 •

edited

Loading

LukeWeidenwalker Dec 1, 2022 •

edited

Loading