Skip to content

Accept to use variable and categorical variable from dataframe index  #211

Open
@rknyip

Description

Very often in panel regression, the fixed effect is implemented as categorical variable. Currently, unless using some hacky way, patsy cannot read put index as variables. See below example panel dataset,

import statsmodels.api as sm
df_raw = sm.datasets.get_rdataset('pwt_sample', 'stevedata').data.set_index(['isocode', 'year']).drop(['country'], axis=1)
df = df_raw.dropna()
print(df)

And the panel dataframe looks like:

                     pop        hc        rgdpna         rgdpo         rgdpe     labsh          avh         emp          rnna
isocode year                                                                                                                 
AUS     1950    8.354106  2.667302  1.274612e+05  1.141350e+05  1.219940e+05  0.680492  2170.923406    3.429873  6.399912e+05
        1951    8.599923  2.674344  1.307031e+05  1.105431e+05  1.139294e+05  0.680492  2150.846928    3.523916  6.901136e+05
        1952    8.782430  2.681403  1.253531e+05  1.088834e+05  1.112199e+05  0.680492  2130.956115    3.591675  7.045624e+05
        1953    8.950892  2.688482  1.389522e+05  1.226885e+05  1.233289e+05  0.680492  2111.249251    3.653409  7.331073e+05
        1954    9.159148  2.695580  1.500607e+05  1.318364e+05  1.314721e+05  0.680492  2091.724634    3.731083  7.714542e+05
...                  ...       ...           ...           ...           ...       ...          ...         ...           ...
USA     2015  320.878310  3.728116  1.877616e+07  1.878487e+07  1.890040e+07  0.595646  1770.023174  150.248474  6.505781e+07
        2016  323.015995  3.733411  1.909750e+07  1.909468e+07  1.928048e+07  0.593773  1766.744125  152.396957  6.597406e+07
        2017  325.084756  3.738714  1.954298e+07  1.954298e+07  1.975004e+07  0.596151  1763.726676  154.672318  6.694270e+07
        2018  327.096265  3.744024  2.012858e+07  2.015604e+07  2.036575e+07  0.594326  1774.703811  156.675903  6.800735e+07
        2019  329.064917  3.749341  2.056359e+07  2.059635e+07  2.085650e+07  0.597091  1765.346390  158.299591  6.905906e+07

Very often we need patsy to do a regression with from_formula which indeed uses patsy.dmatrices:

sm.OLS.from_formula('pop ~ rgdpna + year + C(isocode)', df_raw).fit().summary()

This prompts errors:

PatsyError: Error evaluating factor: NameError: name 'isocode' is not defined
    pop ~ rgdpna + year + C(isocode)
                          ^^^^^^^^^^

Very often it has the panel dimension is in the index level and users would like to use them in fixed effect and endog. Any chance patsy could support to use dataframe index? Thanks.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions