Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept to use variable and categorical variable from dataframe index #211

Open
rknyip opened this issue Oct 9, 2024 · 1 comment
Open

Comments

@rknyip
Copy link

rknyip commented Oct 9, 2024

Very often in panel regression, the fixed effect is implemented as categorical variable. Currently, unless using some hacky way, patsy cannot read put index as variables. See below example panel dataset,

import statsmodels.api as sm
df_raw = sm.datasets.get_rdataset('pwt_sample', 'stevedata').data.set_index(['isocode', 'year']).drop(['country'], axis=1)
df = df_raw.dropna()
print(df)

And the panel dataframe looks like:

                     pop        hc        rgdpna         rgdpo         rgdpe     labsh          avh         emp          rnna
isocode year                                                                                                                 
AUS     1950    8.354106  2.667302  1.274612e+05  1.141350e+05  1.219940e+05  0.680492  2170.923406    3.429873  6.399912e+05
        1951    8.599923  2.674344  1.307031e+05  1.105431e+05  1.139294e+05  0.680492  2150.846928    3.523916  6.901136e+05
        1952    8.782430  2.681403  1.253531e+05  1.088834e+05  1.112199e+05  0.680492  2130.956115    3.591675  7.045624e+05
        1953    8.950892  2.688482  1.389522e+05  1.226885e+05  1.233289e+05  0.680492  2111.249251    3.653409  7.331073e+05
        1954    9.159148  2.695580  1.500607e+05  1.318364e+05  1.314721e+05  0.680492  2091.724634    3.731083  7.714542e+05
...                  ...       ...           ...           ...           ...       ...          ...         ...           ...
USA     2015  320.878310  3.728116  1.877616e+07  1.878487e+07  1.890040e+07  0.595646  1770.023174  150.248474  6.505781e+07
        2016  323.015995  3.733411  1.909750e+07  1.909468e+07  1.928048e+07  0.593773  1766.744125  152.396957  6.597406e+07
        2017  325.084756  3.738714  1.954298e+07  1.954298e+07  1.975004e+07  0.596151  1763.726676  154.672318  6.694270e+07
        2018  327.096265  3.744024  2.012858e+07  2.015604e+07  2.036575e+07  0.594326  1774.703811  156.675903  6.800735e+07
        2019  329.064917  3.749341  2.056359e+07  2.059635e+07  2.085650e+07  0.597091  1765.346390  158.299591  6.905906e+07

Very often we need patsy to do a regression with from_formula which indeed uses patsy.dmatrices:

sm.OLS.from_formula('pop ~ rgdpna + year + C(isocode)', df_raw).fit().summary()

This prompts errors:

PatsyError: Error evaluating factor: NameError: name 'isocode' is not defined
    pop ~ rgdpna + year + C(isocode)
                          ^^^^^^^^^^

Very often it has the panel dimension is in the index level and users would like to use them in fixed effect and endog. Any chance patsy could support to use dataframe index? Thanks.

@bashtage
Copy link
Contributor

bashtage commented Oct 9, 2024

patsy is in maintenance (only) mode, and so this behavior is unlikely to change. Looking in the index is also potentially problematic since there might be named indices with the same names as columns. Enabling this could mean perfectly valid code under the existing rules becomes ambiguous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants