[ENH] `GLM` with multiple `distributions` and `link` function support #384

ShreeshaM07 · 2024-06-12T19:45:46Z

Reference Issues/PRs

fixes #383 and closes #230

What does this implement/fix? Explain your changes.

This creates an adapter converting the statsmodels GLM families to skpro equivalent giving GLMs a broader capability of distributions and link functions.

Does your contribution introduce a new dependency? If yes, which one?

No

Did you add any tests for the change?

Yes

PR checklist

For all contributions

I've added myself to the list of contributors with any new badges I've earned :-)
How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
See here for full badge reference
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

For new estimators

I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
dependency isolation, see the estimator dependencies guide.

ShreeshaM07 · 2024-06-16T19:41:13Z

I have thought of 2 possible ways to pass the offset and exposure parameters while predicting.
Design 1 and Design2 .

Design 1

Idea behind this is that I initialize the object of the class with whether offset and exposure is going to be passed while predicting and will contain boolean values. Since we cannot have an argument offset and exposure in the _predict function user must pass additional columns with names offset and exposure which will be removed/dropped while fitting and predict_proba but will be converted to array and passed to statsmodels predict thereby keeping consistency in the number of columns and rows.
Pros

It will allow user to keep changing the offset and exposure params even after fitting data.
User doesn't have to keep re initializing the model in case they want to change the offset/exposure values.

Cons

User will have to create 2 extra columns with the exact same name of offset and exposure each time in the exog/X variable.
No way to add a test setting to check this as it cannot add extra columns with the name offset/exposure.

Design 2

Idea here is that I pass the offset/exposure array while initializing the object itself. It must have the same length as the number of rows in the X that will be passed while predicting.
Pros

Simple code :)
User doesn't have to struggle having to modify the X while fitting.

Cons

shape of X passed to predict must be known while initializing/constructor the object itself.
If the size of prediction X needs to be changed the offset/exposure must be changed for which it will need to be re-initialized again.
Re-fitting the model will need to happen each time X size changes for predict.
Even for this method there is no way to add a test setting as size of X used for predicting is not known prior.

My Conclusion

Both the designs above are not a foolproof way but they both give correct answers. Since we cannot add a test setting for both these I am not sure I can think of a better way to do it where it can be added to the test setting too.

fkiraly · 2024-06-16T20:31:25Z

A strong design principle in sklearn-like designs is separating data from model specification, because only that allows reliable model composition.

As Option 2 violates that principle (offset and exposure are part of the data), as your cons imply, I would have a very strong preference towards option 1.

I would vary the idea a little, by adding parameters exposure_var and offset_var, which are pandas index elements or int. If int, it is assumed iloc; otherwise loc. There are sensible defaults, I would say None, which means no exposure/offset is passed (and I believe statsmodels assumes constant exposure and zero offset then).

Re testing, this will require a separate test added to a glm speific test module.

ShreeshaM07 · 2024-06-17T08:18:44Z

I would vary the idea a little, by adding parameters exposure_var and offset_var, which are pandas index elements or int. If int, it is assumed iloc; otherwise loc. There are sensible defaults, I would say None, which means no exposure/offset is passed (and I believe statsmodels assumes constant exposure and zero offset then).

I am little unsure on what this means. Do you mean to add these 2 parameters exposure_var and offset_var along with the already present exposure and offset bool parameters as I have done in Design1.
If we are adding along with these then I think there would be no need to have the extra columns in X that the user has to pass while fitting,predicting etc. This would again mean that we are not separating the model specification from the data.

Could you please elaborate the idea a little as to what exactly the exposure_var and offset_var will contain.

fkiraly · 2024-06-17T13:49:48Z

I am little unsure on what this means. Do you mean to add these 2 parameters exposure_var and offset_var along with the already present exposure and offset bool parameters as I have done in

I was suggesting to replace these with two more informative variables, concretely:

Replace exposure by exposure_var, type is pandas index element (e.g., string). If exposure_var = None, it behaves like your exposure = False. If exposure_var = "exposure", it behaves like your exposure = True. And exposure_var = "sth_else" can be used for pointing to another variable.

Could you please elaborate the idea a little as to what exactly the exposure_var and offset_var will contain.

The type would be a single pandas index element. For instance, str. If str, then it would point ot the column loc. I also suggest to use integers iloc rather than loc.

This would again mean that we are not separating the model specification from the data.

I think it is fine to pass data schema references to the specification, that is different from the data itself (i.e., the entries of the data frame).

fkiraly · 2024-06-17T13:52:23Z

skpro/regression/linear/_glm.py

    def __init__(
        self,
+        family="Normal",


it would have been great to have these as the first params from the get-go, but right now we cannot add these at the start of the parameter list due to the deprecation policy - it would make user code break.

We have to add the new params at the end, and I would suggest following deprecation policy to move them to the start eventually, see https://www.sktime.net/en/latest/developer_guide/deprecation.html

This reverts commit e6b20d1.

ShreeshaM07 · 2024-06-17T19:12:26Z

Replace exposure by exposure_var, type is pandas index element (e.g., string). If exposure_var = None, it behaves like your exposure = False. If exposure_var = "exposure", it behaves like your exposure = True. And exposure_var = "sth_else" can be used for pointing to another variable.

Yea this makes more sense and I have completed implementing it that way and also re ordered the new params to the end. Next we will have to work on adding test setting for it.

fkiraly · 2024-06-17T20:23:05Z

great! btw, if you want to move the position of the parameters later, we should follow the "move parameter position" recipe - we can make the change right away.

skpro/regression/linear/_glm.py

fkiraly · 2024-06-19T23:21:46Z

skpro/regression/linear/_glm.py

@@ -201,6 +332,88 @@ def __init__(
        self.max_start_irls = max_start_irls
        self.add_constant = add_constant

+        if family == "0":


this looks excessive - have you looked at the deprecation handling? It has examples at the end.

I think we should avoid setting new defaults for unaffected parameters.

fkiraly

Nice! Great extension.

Small blocking change requests relate to:

the docstring should be clearer on how the new parameters are handled
the change pattern is a bit excessive, so there is risk of error. Have you checked that these are indeed the right defaults?
- there is a lot of repetition - if this were "final" code I would say this should be made more DRY

skpro/regression/linear/_glm.py

ShreeshaM07 · 2024-06-20T09:01:54Z

I've made the changes based on the review please let me know if anything else needs modification.

fkiraly · 2024-06-20T11:02:01Z

skpro/regression/linear/_glm.py

@@ -182,11 +314,21 @@ def __init__(
        disp=False,
        max_start_irls=3,
        add_constant=False,
+        family="0",


there is actually no condition that triggers the warning. The correct condition would be, "user passes a positional argument to __init__", but that is sth we cannot easily detect.

Unless you have an idea how to do that, I would suggest we use the final defaults already and not "0" etc, and we always raise the warning.

skpro/regression/linear/_glm.py

ShreeshaM07 and others added 6 commits June 11, 2024 13:43

[ENH] test_methods_p handling shuffle

eaa3aad

Merge branch 'sktime:main' into tmp

c822d3d

[ENH] GLM with multiple family and link support

2906e11

[ENH] GLMs with multiple link and distribution support

2975628

[ENH] GLMs with multiple distributions and links

c5802a3

default "Normal"

38385d8

ShreeshaM07 marked this pull request as draft June 12, 2024 20:20

fkiraly assigned ShreeshaM07 Jun 13, 2024

ShreeshaM07 and others added 5 commits June 13, 2024 17:03

modified gamma params

888d67a

link function support

f6ff39e

Merge branch 'sktime:main' into glm

40c9986

offset and exposure added as bool and part of X

4bdd71d

offset and exposure initialized in constructor itself with size

e6b20d1

Back to no offset/exposure

8945ebd

ShreeshaM07 mentioned this pull request Jun 17, 2024

[MENTEE] Shreesha M sktime/mentoring#40

Open

fkiraly reviewed Jun 17, 2024

View reviewed changes

ShreeshaM07 added 3 commits June 17, 2024 23:15

Revert "offset and exposure initialized in constructor itself with size"

064517c

This reverts commit e6b20d1.

offset_var and exposure_var implemented

e79cdd4

params order chaged for deprecation handling

53ec0d1

fkiraly added enhancement module:regression probabilistic regression module interfacing algorithms Interfacing existing algorithms/estimators from third party packages labels Jun 17, 2024

ShreeshaM07 added 2 commits June 19, 2024 00:25

added test_glm for offset and exposure

f0e9bcc

deprecation for changing sequence of params

a1523ea

ShreeshaM07 marked this pull request as ready for review June 19, 2024 15:32

removed sktime dependecy in warning

a75b097