Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] GLM with multiple distributions and link function support #384

Merged
merged 21 commits into from
Jun 21, 2024

Conversation

ShreeshaM07
Copy link
Contributor

@ShreeshaM07 ShreeshaM07 commented Jun 12, 2024

Reference Issues/PRs

fixes #383 and closes #230

What does this implement/fix? Explain your changes.

This creates an adapter converting the statsmodels GLM families to skpro equivalent giving GLMs a broader capability of distributions and link functions.

Does your contribution introduce a new dependency? If yes, which one?

No

Did you add any tests for the change?

Yes

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
    dependency isolation, see the estimator dependencies guide.

@ShreeshaM07 ShreeshaM07 marked this pull request as draft June 12, 2024 20:20
@ShreeshaM07
Copy link
Contributor Author

I have thought of 2 possible ways to pass the offset and exposure parameters while predicting.
Design 1 and Design2 .

Design 1

Idea behind this is that I initialize the object of the class with whether offset and exposure is going to be passed while predicting and will contain boolean values. Since we cannot have an argument offset and exposure in the _predict function user must pass additional columns with names offset and exposure which will be removed/dropped while fitting and predict_proba but will be converted to array and passed to statsmodels predict thereby keeping consistency in the number of columns and rows.
Pros

  • It will allow user to keep changing the offset and exposure params even after fitting data.
  • User doesn't have to keep re initializing the model in case they want to change the offset/exposure values.

Cons

  • User will have to create 2 extra columns with the exact same name of offset and exposure each time in the exog/X variable.
  • No way to add a test setting to check this as it cannot add extra columns with the name offset/exposure.

Design 2

Idea here is that I pass the offset/exposure array while initializing the object itself. It must have the same length as the number of rows in the X that will be passed while predicting.
Pros

  • Simple code :)
  • User doesn't have to struggle having to modify the X while fitting.

Cons

  • shape of X passed to predict must be known while initializing/constructor the object itself.
  • If the size of prediction X needs to be changed the offset/exposure must be changed for which it will need to be re-initialized again.
  • Re-fitting the model will need to happen each time X size changes for predict.
  • Even for this method there is no way to add a test setting as size of X used for predicting is not known prior.

My Conclusion

Both the designs above are not a foolproof way but they both give correct answers. Since we cannot add a test setting for both these I am not sure I can think of a better way to do it where it can be added to the test setting too.

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 16, 2024

A strong design principle in sklearn-like designs is separating data from model specification, because only that allows reliable model composition.

As Option 2 violates that principle (offset and exposure are part of the data), as your cons imply, I would have a very strong preference towards option 1.

I would vary the idea a little, by adding parameters exposure_var and offset_var, which are pandas index elements or int. If int, it is assumed iloc; otherwise loc. There are sensible defaults, I would say None, which means no exposure/offset is passed (and I believe statsmodels assumes constant exposure and zero offset then).

Re testing, this will require a separate test added to a glm speific test module.

@ShreeshaM07
Copy link
Contributor Author

ShreeshaM07 commented Jun 17, 2024

I would vary the idea a little, by adding parameters exposure_var and offset_var, which are pandas index elements or int. If int, it is assumed iloc; otherwise loc. There are sensible defaults, I would say None, which means no exposure/offset is passed (and I believe statsmodels assumes constant exposure and zero offset then).

I am little unsure on what this means. Do you mean to add these 2 parameters exposure_var and offset_var along with the already present exposure and offset bool parameters as I have done in Design1.
If we are adding along with these then I think there would be no need to have the extra columns in X that the user has to pass while fitting,predicting etc. This would again mean that we are not separating the model specification from the data.

Could you please elaborate the idea a little as to what exactly the exposure_var and offset_var will contain.

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 17, 2024

I am little unsure on what this means. Do you mean to add these 2 parameters exposure_var and offset_var along with the already present exposure and offset bool parameters as I have done in

I was suggesting to replace these with two more informative variables, concretely:

Replace exposure by exposure_var, type is pandas index element (e.g., string). If exposure_var = None, it behaves like your exposure = False. If exposure_var = "exposure", it behaves like your exposure = True. And exposure_var = "sth_else" can be used for pointing to another variable.

Could you please elaborate the idea a little as to what exactly the exposure_var and offset_var will contain.

The type would be a single pandas index element. For instance, str. If str, then it would point ot the column loc. I also suggest to use integers iloc rather than loc.

This would again mean that we are not separating the model specification from the data.

I think it is fine to pass data schema references to the specification, that is different from the data itself (i.e., the entries of the data frame).

def __init__(
self,
family="Normal",
Copy link
Collaborator

@fkiraly fkiraly Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would have been great to have these as the first params from the get-go, but right now we cannot add these at the start of the parameter list due to the deprecation policy - it would make user code break.

We have to add the new params at the end, and I would suggest following deprecation policy to move them to the start eventually, see https://www.sktime.net/en/latest/developer_guide/deprecation.html

@ShreeshaM07
Copy link
Contributor Author

Replace exposure by exposure_var, type is pandas index element (e.g., string). If exposure_var = None, it behaves like your exposure = False. If exposure_var = "exposure", it behaves like your exposure = True. And exposure_var = "sth_else" can be used for pointing to another variable.

Yea this makes more sense and I have completed implementing it that way and also re ordered the new params to the end. Next we will have to work on adding test setting for it.

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 17, 2024

great! btw, if you want to move the position of the parameters later, we should follow the "move parameter position" recipe - we can make the change right away.

@fkiraly fkiraly added enhancement module:regression probabilistic regression module interfacing algorithms Interfacing existing algorithms/estimators from third party packages labels Jun 17, 2024
@ShreeshaM07 ShreeshaM07 marked this pull request as ready for review June 19, 2024 15:32
@@ -201,6 +332,88 @@ def __init__(
self.max_start_irls = max_start_irls
self.add_constant = add_constant

if family == "0":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks excessive - have you looked at the deprecation handling? It has examples at the end.

I think we should avoid setting new defaults for unaffected parameters.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Great extension.

Small blocking change requests relate to:

  • the docstring should be clearer on how the new parameters are handled
  • the change pattern is a bit excessive, so there is risk of error. Have you checked that these are indeed the right defaults?
    • there is a lot of repetition - if this were "final" code I would say this should be made more DRY

@ShreeshaM07
Copy link
Contributor Author

I've made the changes based on the review please let me know if anything else needs modification.

@@ -182,11 +314,21 @@ def __init__(
disp=False,
max_start_irls=3,
add_constant=False,
family="0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is actually no condition that triggers the warning. The correct condition would be, "user passes a positional argument to __init__", but that is sth we cannot easily detect.

Unless you have an idea how to do that, I would suggest we use the final defaults already and not "0" etc, and we always raise the warning.

@fkiraly fkiraly merged commit df8c617 into sktime:main Jun 21, 2024
26 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement interfacing algorithms Interfacing existing algorithms/estimators from third party packages module:regression probabilistic regression module
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[ENH] Multiple link function support for GLMs [ENH] further work on GLMRegressor interfacing statsmodels GLM
2 participants