Help understanding how to use the update
functionality
#825
-
Hey folks wanted to ask about the intended usage of the In my use case, I'm looking to estimate an ols regression with a formula specification that's supplied by users of our app. For example that might look like import pyfixest as pf
model = pf.feols(
'metric_value ~ variant * (covariate_1 + covariate_2 + C(covariate_3))',
data = df,
) However sometimes either the number of samples is too high, or the cardinality of the categorical variable is too high and running the regression in one step crashes the thread. Reading the docs, it seems that there's an update method available - which is exactly what I want however
However it's not clear to me how to coerce my DataFrame into the format that middle_index = len(df_sample) // 2
top_half = df_sample.iloc[:middle_index]
bottom_half = df_sample.iloc[middle_index:]
incremental_model = pf.feols(
'metric_value ~ variant * (covariate_1 + covariate_2 + C(covariate_3))',
data = top_half,
)
incremental_model_2=pf.feols(
'metric_value ~ variant * (covariate_1 + covariate_2 + C(covariate_3))',
data = top_half,
)
incremental_model.update(incremental_model_2._X, incremental_model_2._Y, inplace=True)
--------
2421 epsi_n_plus_1 = y_new - X_new @ self._beta_hat
2422 gamma_n_plus_1 = np.linalg.inv(X_n_plus_1.T @ X_n_plus_1) @ X_new.T
-> 2423 beta_n_plus_1 = self._beta_hat + gamma_n_plus_1 @ epsi_n_plus_1
2424 if inplace:
2425 self._X = X_n_plus_1
ValueError: operands could not be broadcast together with shapes (9,) (9,50000) Could I please have some guidance on what the expected usage is for |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
Hi @DTchebotarev , thanks for reaching out about this! I've tried to use the You rightly point out that there is currently a clash between its numpy API and the formula API of import pyfixest as pf
from pyfixest.estimation.model_matrix_fixest_ import model_matrix_fixest
data = pf.get_data()
fit = pf.feols("Y ~ X1 + f1 + f2", data=data)
FixestFormula = fit.FixestFormula
mm = model_matrix_fixest(FixestFormula, data)
mm which is more or less a wrapper around Maybe one improvement here could be to add another method, Other things to improve for import pyfixest as pf
import numpy as np
from pyfixest.estimation.model_matrix_fixest_ import model_matrix_fixest
data = pf.get_data()
data_subsample = data.sample(frac=0.5)
m = pf.feols("Y ~ X1 + X2", data=data_subsample)
# current coefficient vector
m._beta_hat
# array([ 0.82133825, -0.92073708, -0.16408103])
mm = model_matrix_fixest(m.FixestFormula, data.sample(frac=0.1).reset_index())
Y = mm["Y"].to_numpy()
X = mm["X"].to_numpy()
m.update(X_new = X, y_new = Y.flatten())
# array([ 0.96541071, -1.04052026, -0.18548232])
# second update call: same result as before (no state update of the Feols class!)
m.update(X_new = X, y_new = Y.flatten())
# array([ 0.96541071, -1.04052026, -0.18548232]) And as stated above, we don't provide any CIs yet. Plus update does not work with fixed effects. So tentatively, I'd say that the update method will not work for your use case unfortunately. Btw, maybe you can help me with Python vocab (still learning) - would you consider the |
Beta Was this translation helpful? Give feedback.
-
There's a couple of other options how you can scale your regression models, i.e. via compression algos as described here - these are implemented in either duckreg or in pyfixest via the Other potential strategies: you could run batch regression (might cause problems with fixed effects) or simply use sparse solvers - both are unfortunately not supported by pyfixest. Btw, I just gave a presentation on how to speed & scale up regression models the other week- the slides are not yet on gh (I will reuse them for another talk and want to brush them up a bit), happy to send them to you if potentially helpful? |
Beta Was this translation helpful? Give feedback.
-
Ah I forgot to ask - can you tell me how large your sample / number of fixed effects approximately is? I think I've run models on tens of millions of observation and thousands of fixed effects & so far it's mostly worked 😄In your case, it is the demeaning algo that is failing, or do your kernels crash with memory errors? If the latter, you could try to set |
Beta Was this translation helpful? Give feedback.
-
Thanks for the super detailed response! I found I do think that's publicly exposed (as in it can be imported) but Edit: Reading is hard. After incorporating some of your tweaks tweaks though this actually works quite well - so many thanks for that! For posterity, this is what worked for me: from pyfixest.estimation.model_matrix_fixest_ import model_matrix_fixest
middle_index = len(df_sample) // 2
top_half = df_sample.iloc[:middle_index]
bottom_half = df_sample.iloc[middle_index:]
incremental_model = pf.feols(
'metric_value ~ variant * (covariate_1 + covariate_2 + C(covariate_3))',
data = top_half,
copy_data=False
)
mm = model_matrix_fixest(incremental_model.FixestFormula, bottom_half.reset_index())
incremental_model.update(mm['X'].to_numpy(), mm['Y'].to_numpy().flatten()) Though reading more about it, seems that
I've got on the order of millions of observations, and I'm trying to see how far I can push the cardinality of covariates. So far I can't get to 1000. Part of why I'm so vague on the estimates is that I don't actually know ahead of time. This regression will be specified by users of our internal A/B testing tool, so the number of observations varies by the number of subjects in the A/B test, and the number of columns varies by the regression the data scientist chooses to specify. I did see your email with the slides - haven't had a chance to go through them fully but will see what other tricks I can use. Much appreciate the help so far! |
Beta Was this translation helpful? Give feedback.
Hi @DTchebotarev , thanks for reaching out about this!
I've tried to use the
update()
method the other day and also stumbled over a couple of shortcomings. I think @apoorvalal's and my long term plan was to brush it up and some point and implement anytime-valid inference for linear models as described here, but we never really got there.You rightly point out that there is currently a clash between its numpy API and the formula API of
pf.feols()
. The best way to work around it at the moment is to use the (not-directly exposed)model_matrix_fixest
function (docs here):