Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] interface TweedieRegressor from sklearn as skpro regressor #423

Open
fkiraly opened this issue Jul 11, 2024 · 6 comments
Open

[ENH] interface TweedieRegressor from sklearn as skpro regressor #423

fkiraly opened this issue Jul 11, 2024 · 6 comments
Labels
feature request New feature or request interfacing algorithms Interfacing existing algorithms/estimators from third party packages module:regression probabilistic regression module

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 11, 2024

We should try to interface TweedieRegressor from sklearn as an skpro regressor.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TweedieRegressor.html

Notes on implementation:

  • the current adapter will not work because it does not follow the return_std interface, but we can use _prep_skl_df.
  • We would need a Tweedie distribution in skpro, currently it is not implemented.
  • Tweedie has three parameters: power, location, scale. Power is set fixed in the sklearn TweedieRegressor, location is returned by predict, but it is unclear whether scale can be obtained from it. Perhaps @fsaforo1 has insight on this point.

FYI @ShreeshaM07, this is very similar to your previous work on statsmodels GLM!

@fkiraly fkiraly added module:regression probabilistic regression module interfacing algorithms Interfacing existing algorithms/estimators from third party packages feature request New feature or request labels Jul 11, 2024
@ShreeshaM07
Copy link
Contributor

ShreeshaM07 commented Jul 15, 2024

Some points regarding the same

  • return_std is not available in case of TweedieRegressor in the predict method of sklearn so we may not be able to find the value of scale in the cases when the underlying distribution requires it for ex Normal.
  • Since this is just an extension of GLMRegressor why can we not just interface the Tweedie distribution and then add it in the family parameter of GLMRegressor? Not really sure where we can interface the distribution from though.

A doubt regarding the TweedieRegressor, is it not just an interface to possible regressors for different families for ex Poisson,Gaussian,Gamma ? So then is there any difference in implementing the TweedieRegressor if it is just going to expose these different regressors ?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 15, 2024

To answer these:

  • I do not think this would be an extension of GLMRegressor, that interfaces the GLM from statsmodels. The sklearn TweedieRegressor is a completely different object. Of course it would be nice to add support for the Tweedie in statsmodels, that is a different, useful issue, and may meet the use case of @fsafaro1.
  • this scipy issue discusses the Tweedie distribution: Add Tweedie distributions to scipy.stats scipy/scipy#11291 (comment) and concludes that the scipy interface is not general enough because it is mixed type. skpro is general enough, so with the pointers in there we could implement it, either entirely from scratch, or interfacing some of the component functions such as Bessel.
  • for the sklearn Tweedie regressor, the remaining quesiton is still where to get the scale from. It would not be much of a Tweedie regressor if tha twould be impossible to obtain...

is it not just an interface to possible regressors for different families for ex Poisson,Gaussian,Gamma

yes, but for non-integer p parameter these are very specific families that are also not available yet. It is a good question whether the distribution should internally decompose in these case distinctions.

@ShreeshaM07
Copy link
Contributor

ShreeshaM07 commented Jul 16, 2024

this scipy issue discusses the Tweedie distribution: scipy/scipy#11291 (comment) and concludes that the scipy interface is not general enough because it is mixed type. skpro is general enough, so with the pointers in there we could implement it, either entirely from scratch, or interfacing some of the component functions such as Bessel.

From the conversation I can infer that we can implement this in skpro as it allows for mixed type distributions with pdf and pmf in different intervals. https://lorentzen.ch/index.php/2024/06/17/a-tweedie-trilogy-part-iii-from-wrights-generalized-bessel-function-to-tweedies-compound-poisson-distribution/ seems to be a very informative post explaining the Tweedie distribution. It also gives code snippet for the pdf and pmf of the function compound poisson and gamma function.

import numpy as np
from scipy.special import wright_bessel


def cpg_pmf(mu, phi, p):
    """Compound Poisson Gamma point mass at zero."""
    return np.exp(-np.power(mu, 2 - p) / (phi * (2 - p)))

def cpg_pdf(x, mu, phi, p):
    """Compound Poisson Gamma pdf."""
    if not (1 < p < 2):
        raise ValueError("1 < p < 2 required")
    theta = np.power(mu, 1 - p) / (1 - p)
    kappa = np.power(mu, 2 - p) / (2 - p)
    alpha = (2 - p) / (1 - p)
    t = ((p - 1) * phi / x)**alpha
    t /= (2 - p) * phi
    a = 1 / x * wright_bessel(-alpha, 0, t)
    return a * np.exp((x * theta - kappa) / phi)

This can be utilized along with the usage of the wright_bessel function in scipy.special.

for the sklearn Tweedie regressor, the remaining quesiton is still where to get the scale from. It would not be much of a Tweedie regressor if tha twould be impossible to obtain...

I think there is a very round about way to do this by passing the x value to PoissonRegressor and GammaRegressor separately and finding out the values of lambda,a and b.
image
As we know the mean=return of predict we know p power parameter is fixed. We can calculate phi or scale using the formula below . Is it not possible that way?

@ShreeshaM07
Copy link
Contributor

Some thought on the Tweedie Distribution

  • Since it is distinguished into type of distribution using the power parameter itself we can just call pdf of Normal when pw=0 where pw is the power parameter, call pdf of Poisson when pw=1, pdf of Gamma when pw=2 and call the code snippet in the above comment when p is in (1,2)
    image

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 17, 2024

From the conversation I can infer that we can implement this in skpro as it allows for mixed type distributions with pdf and pmf in different intervals.

Yes, assuming you mean the p parameter. In places where the distribution is entirely discrete or continuous, the pdf or pmf will return zero.

Further, here's an interesting option, since multiple already implemented distributions figure as special cases:

  • we could implement the individual families separately, e.g., compound Poisson-Gamma
  • define Tweedie as a _DelegatedDistribution and delegate to one of the Tweedie ED families depending on p.
  • as you say, we need to ensure that the parameters are mapped correctly, e.g., Tweedie being parameterized by mu, sigma, and Gamma by alpha, beta.
  • probably we also want to change the _DelegatedDistribution to delegate private, not public methods. This could be done in a separate PR - the current delegator delegates public methods

Here is an illustration of the suggested delegator approach:
image
(Tweedie is a delegator compound of Tweedie ED families)

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 18, 2024

Opened new issue on Tweedie distribution here, as that does not seem too straightforward - for further discussion.
#429

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request interfacing algorithms Interfacing existing algorithms/estimators from third party packages module:regression probabilistic regression module
Projects
None yet
Development

No branches or pull requests

2 participants