-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML Generic API #497
ML Generic API #497
Conversation
I'm not sure that that should be the goal.
I don't think that is a problem on its own. The list of pre-defined processes is also long, but that is fine. |
Thanks for your solid comments that cannot be overlooked, it reminds me of this article ; Simple is not easy. There will be an initial Engineering overhead for sure as you have pointed out. My goal for this was more targeted to the general public/users i.e. Greater accessibility and ease of use.
On my end, It does not mean we should have 20+ processes for ML/DL. |
Even for the user I think it will be worse. e.g. with the python client, the current way of working would result in something like ml_fit_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42) Which is easy/simple and highly accessible I would think: it just uses the basics of how functions are called, everything is documented the normal way, not only the available arguments and their defaults, but also basics of random forest, ... the proposed approach would result in something like ml_fit(predictors=cube1, labels=cube2, ml_method="random_forest", params = {"num_trees":200, "max_variables":"sqrt", "seed":42}) This is less easy/simple for the user: the |
Fair enough. I will continue with the specific APIs. Now, one last thought, for classification and regression tasks if we are limiting the number of params to tune and they are similar then there is no need to have 2 APIs for the categories. i.e. # classification
ml_fit_class_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)
# regression
ml_fit_regr_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42) the 2 above can be replaced with # classification or regression
ml_fit_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42) Therefore internally it can be determined from the labels if this is a classification or a regression task. Do you foresee any bottlenecks with this? |
I'm not sure this is a safe assumption. |
I was on vacation for over a month during this time, so I left this discussion on hold. I'll look into your point. Anyway, regarding the parameters, I think we should not separate the predictor and target. Instead, we can have a "training-data" parameter that includes both predictors and targets. A target variable should just specify the column name of the targets/labels within the training data. |
Closing this, I will continue with the already existing approach with a few changes as in #517 |
I saw the need for this as an alternative for #441 , if we keep following the pattern of ml_fit*_ as #484 , #483 , #493 and so forth. We will end up with a long list of ml processes.
Getting inspiration from tools like caret in R ecosystem. We can ideally achieve a minimalistic API.
Labels can be used to identify a classification and regression problem.
Parameters for various ml models can be passed as dictionaries: e.g.
ml_fit(predictor, label, ml_method ="random_forest", params = {"num_trees":200, "max_variables":"sqrt", "seed":42})
ml_fit(predictor, label, ml_method = "mlp", params = {"hidden_layers": [ 512, 512, 512], "dropout_rates": [0.4, 0.3, 0.2], "epochs": 100, "activation_function": "relu", "optimizer": "adam"}
Well, there will be need a for clear documentation dedicated to ML on the openEO website.
Additionally, a need for 2 more methods on the client side
list_ml_models()
for listing the available models andmodel_param_blueprint(model_name, task_type="classification")
for listing the default params of the models. The information should ideally be on the website too.Opinions, suggestions and critics are welcomed as I am still exploring this.