ML Generic API #497

PondiB · 2024-02-21T11:02:22Z

I saw the need for this as an alternative for #441 , if we keep following the pattern of ml_fit*_ as #484 , #483 , #493 and so forth. We will end up with a long list of ml processes.

Getting inspiration from tools like caret in R ecosystem. We can ideally achieve a minimalistic API.

Labels can be used to identify a classification and regression problem.

Parameters for various ml models can be passed as dictionaries: e.g.

Random Forest :
ml_fit(predictor, label, ml_method ="random_forest", params = {"num_trees":200, "max_variables":"sqrt", "seed":42})
Multi-layer Perceptron classifier
ml_fit(predictor, label, ml_method = "mlp", params = {"hidden_layers": [ 512, 512, 512], "dropout_rates": [0.4, 0.3, 0.2], "epochs": 100, "activation_function": "relu", "optimizer": "adam"}

Well, there will be need a for clear documentation dedicated to ML on the openEO website.

Additionally, a need for 2 more methods on the client side list_ml_models() for listing the available models and model_param_blueprint(model_name, task_type="classification") for listing the default params of the models. The information should ideally be on the website too.

Opinions, suggestions and critics are welcomed as I am still exploring this.

soxofaan · 2024-02-26T17:23:36Z

We can ideally achieve a minimalistic API.

I'm not sure that that should be the goal.
You indeed have a small API at the level of the processes (just one process ml_fit), but you pay the cost in other places: those list_ml_models() and model_param_blueprint() have to be defined, you need to provide schema definition/validation for the params of each model, all clients have to start supporting that, all back-ends have to start implementing that, ... It feels like a lot of reinventing things we already have in our current ecosystem of processes.
I'm not sure there is a net win here.

We will end up with a long list of ml processes

I don't think that is a problem on its own. The list of pre-defined processes is also long, but that is fine.

PondiB · 2024-02-28T13:45:57Z

I'm not sure that that should be the goal.

Thanks for your solid comments that cannot be overlooked, it reminds me of this article ; Simple is not easy. There will be an initial Engineering overhead for sure as you have pointed out. My goal for this was more targeted to the general public/users i.e. Greater accessibility and ease of use.

I don't think that is a problem on its own. The list of pre-defined processes is also long, but that is fine.

On my end, It does not mean we should have 20+ processes for ML/DL.

soxofaan · 2024-02-28T17:17:52Z

My goal for this was more targeted to the general public/users i.e. Greater accessibility and ease of use.

Even for the user I think it will be worse.

e.g. with the python client, the current way of working would result in something like

ml_fit_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)

Which is easy/simple and highly accessible I would think: it just uses the basics of how functions are called, everything is documented the normal way, not only the available arguments and their defaults, but also basics of random forest, ...

the proposed approach would result in something like

ml_fit(predictors=cube1, labels=cube2, ml_method="random_forest", params = {"num_trees":200, "max_variables":"sqrt", "seed":42})

This is less easy/simple for the user: the params is now a generic and hard to document thing and heavily depends on ml_method, ml_fit has to explain all possible ML methods, there is no reasonable default for ml_method, ...

PondiB · 2024-02-29T13:41:06Z

This is less easy/simple for the user: the params is now a generic and hard to document thing and heavily depends on ml_method, ml_fit has to explain all possible ML methods, there is no reasonable default for ml_method, ...

Fair enough. I will continue with the specific APIs. Now, one last thought, for classification and regression tasks if we are limiting the number of params to tune and they are similar then there is no need to have 2 APIs for the categories. i.e.

# classification
ml_fit_class_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)
# regression
ml_fit_regr_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)

the 2 above can be replaced with

# classification or regression
ml_fit_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)

Therefore internally it can be determined from the labels if this is a classification or a regression task. Do you foresee any bottlenecks with this?

soxofaan · 2024-03-13T14:52:51Z

Therefore internally it can be determined from the labels if this is a classification or a regression task

I'm not sure this is a safe assumption.
I can imagine there are users that use e.g. integer values as classification labels.

PondiB · 2024-05-28T14:35:50Z

I'm not sure this is a safe assumption.
I can imagine there are users that use e.g. integer values as classification labels.

I was on vacation for over a month during this time, so I left this discussion on hold. I'll look into your point.

Anyway, regarding the parameters, I think we should not separate the predictor and target. Instead, we can have a "training-data" parameter that includes both predictors and targets. A target variable should just specify the column name of the targets/labels within the training data.

FYI: An example training data:

PondiB · 2024-07-17T13:48:30Z

Closing this, I will continue with the already existing approach with a few changes as in #517

PondiB added 2 commits February 20, 2024 23:33

generic ml fit

a0e0ba4

generic ml fit

0d4b897

PondiB added work in progress ML discussion needed DL Deep Learning labels Feb 21, 2024

PondiB closed this Jul 17, 2024

PondiB deleted the ml-generic-api branch September 9, 2024 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML Generic API #497

ML Generic API #497

PondiB commented Feb 21, 2024 •

edited

Loading

soxofaan commented Feb 26, 2024

PondiB commented Feb 28, 2024

soxofaan commented Feb 28, 2024

PondiB commented Feb 29, 2024 •

edited

Loading

soxofaan commented Mar 13, 2024

PondiB commented May 28, 2024

PondiB commented Jul 17, 2024

ML Generic API #497

ML Generic API #497

Conversation

PondiB commented Feb 21, 2024 • edited Loading

soxofaan commented Feb 26, 2024

PondiB commented Feb 28, 2024

soxofaan commented Feb 28, 2024

PondiB commented Feb 29, 2024 • edited Loading

soxofaan commented Mar 13, 2024

PondiB commented May 28, 2024

PondiB commented Jul 17, 2024

PondiB commented Feb 21, 2024 •

edited

Loading

PondiB commented Feb 29, 2024 •

edited

Loading