Help understanding how to use the `update` functionality #825

DTchebotarev · 2025-03-04T20:34:17Z

DTchebotarev
Mar 4, 2025

Hey folks wanted to ask about the intended usage of the update method on the different models - particularly about how to construct the inputs.

In my use case, I'm looking to estimate an ols regression with a formula specification that's supplied by users of our app. For example that might look like

import pyfixest as pf
model = pf.feols(
    'metric_value ~ variant * (covariate_1 + covariate_2 + C(covariate_3))',
    data = df,
)

However sometimes either the number of samples is too high, or the cardinality of the categorical variable is too high and running the regression in one step crashes the thread. Reading the docs, it seems that there's an update method available - which is exactly what I want however update has a completely different API. The docstring says

       X : np.ndarray
           Covariates for new data points. Users expected to ensure conformability
           with existing data.
       y : np.ndarray
           Outcome values for new data points

However it's not clear to me how to coerce my DataFrame into the format that update expects. I looked around for any API that would let me do so. There's no publicly exposed create_model_matrix function without running a full regression, and even when I thought I could cheat and create a throwaway regression object simply to get the transformed X and y matrices I still ran into issues:

middle_index = len(df_sample) // 2
top_half = df_sample.iloc[:middle_index]
bottom_half = df_sample.iloc[middle_index:]

incremental_model = pf.feols(
    'metric_value ~ variant * (covariate_1 + covariate_2 + C(covariate_3))',
    data = top_half,
)

incremental_model_2=pf.feols(
    'metric_value ~ variant * (covariate_1 + covariate_2 + C(covariate_3))',
    data = top_half,
)
incremental_model.update(incremental_model_2._X, incremental_model_2._Y, inplace=True)
--------
   2421 epsi_n_plus_1 = y_new - X_new @ self._beta_hat
   2422 gamma_n_plus_1 = np.linalg.inv(X_n_plus_1.T @ X_n_plus_1) @ X_new.T
-> 2423 beta_n_plus_1 = self._beta_hat + gamma_n_plus_1 @ epsi_n_plus_1
   2424 if inplace:
   2425     self._X = X_n_plus_1

ValueError: operands could not be broadcast together with shapes (9,) (9,50000)

Could I please have some guidance on what the expected usage is for update with a non-trivial formula? Appreciate any help 🙏

Answered by s3alfisc

Mar 4, 2025

Hi @DTchebotarev , thanks for reaching out about this!

I've tried to use the update() method the other day and also stumbled over a couple of shortcomings. I think @apoorvalal's and my long term plan was to brush it up and some point and implement anytime-valid inference for linear models as described here, but we never really got there.

You rightly point out that there is currently a clash between its numpy API and the formula API of pf.feols(). The best way to work around it at the moment is to use the (not-directly exposed) model_matrix_fixest function (docs here):

import pyfixest as pf
from pyfixest.estimation.model_matrix_fixest_ import model_matrix_fixest

data = pf.get_data()
fit = pf

View full answer

s3alfisc · 2025-03-04T21:57:19Z

s3alfisc
Mar 4, 2025
Maintainer

Hi @DTchebotarev , thanks for reaching out about this!

I've tried to use the update() method the other day and also stumbled over a couple of shortcomings. I think @apoorvalal's and my long term plan was to brush it up and some point and implement anytime-valid inference for linear models as described here, but we never really got there.

You rightly point out that there is currently a clash between its numpy API and the formula API of pf.feols(). The best way to work around it at the moment is to use the (not-directly exposed) model_matrix_fixest function (docs here):

import pyfixest as pf
from pyfixest.estimation.model_matrix_fixest_ import model_matrix_fixest

data = pf.get_data()
fit = pf.feols("Y ~ X1 + f1 + f2", data=data)
FixestFormula = fit.FixestFormula

mm = model_matrix_fixest(FixestFormula, data)
mm

which is more or less a wrapper around formulaic plus some extra manipulations needed for fixed effects. This will provide you with design matrix X and dependent variable Y, which you could then feed into .update().

Maybe one improvement here could be to add another method, update_from_formula(), for which users would only provide the data, and internally, we would call model_matrix_fixest?

Other things to improve for .update() - as far as I know, it will not update the internal state of the Feols class, hence if you call it twice, coefficient estimates will not change:

import pyfixest as pf 
import numpy as np
from pyfixest.estimation.model_matrix_fixest_ import model_matrix_fixest

data = pf.get_data()

data_subsample = data.sample(frac=0.5)
m = pf.feols("Y ~ X1 + X2", data=data_subsample)
# current coefficient vector
m._beta_hat
# array([ 0.82133825, -0.92073708, -0.16408103])

mm = model_matrix_fixest(m.FixestFormula, data.sample(frac=0.1).reset_index())
Y = mm["Y"].to_numpy()
X = mm["X"].to_numpy()

m.update(X_new = X, y_new = Y.flatten())
# array([ 0.96541071, -1.04052026, -0.18548232])
# second update call: same result as before (no state update of the Feols class!)
m.update(X_new = X, y_new = Y.flatten())
# array([ 0.96541071, -1.04052026, -0.18548232])

And as stated above, we don't provide any CIs yet. Plus update does not work with fixed effects.

So tentatively, I'd say that the update method will not work for your use case unfortunately.

Btw, maybe you can help me with Python vocab (still learning) - would you consider the model_matrix function above publicly exposed? I am getting mixed answers from my LLM helpers =)

0 replies

s3alfisc · 2025-03-04T22:12:24Z

s3alfisc
Mar 4, 2025
Maintainer

There's a couple of other options how you can scale your regression models, i.e. via compression algos as described here - these are implemented in either duckreg or in pyfixest via the use_compression argument. If you have balanced panels and unit and time fixed effects, both can run very efficiently, but can take a while if you either have a) multiple high dimensional fixed effects or b) need clustered errors (both use a bootstrap). We have some benchmarks in this paper. In pyfixest, you could use compression algos for up to two fixed effects via the use_compression = True argument (in which case the two-way fixed effects estimator is estimated via 2-way Mundlak - both approaches are equivalent when panels are balanced).

Other potential strategies: you could run batch regression (might cause problems with fixed effects) or simply use sparse solvers - both are unfortunately not supported by pyfixest.

Btw, I just gave a presentation on how to speed & scale up regression models the other week- the slides are not yet on gh (I will reuse them for another talk and want to brush them up a bit), happy to send them to you if potentially helpful?

0 replies

s3alfisc · 2025-03-04T22:17:10Z

s3alfisc
Mar 4, 2025
Maintainer

However sometimes either the number of samples is too high, or the cardinality of the categorical variable is too high and running the regression in one step crashes the thread

Ah I forgot to ask - can you tell me how large your sample / number of fixed effects approximately is? I think I've run models on tens of millions of observation and thousands of fixed effects & so far it's mostly worked 😄In your case, it is the demeaning algo that is failing, or do your kernels crash with memory errors? If the latter, you could try to set copy_data = False and store_data = False and see if it helps?

2 replies

apoorvalal Mar 5, 2025
Collaborator

alternatively, use compression !

DTchebotarev Mar 7, 2025
Author

I do need predictions so unfortunately compression isn't an option.

DTchebotarev · 2025-03-06T22:53:54Z

DTchebotarev
Mar 6, 2025
Author

Thanks for the super detailed response!

I found model_matrix_fixest in the code and had tried something similar, but I think my mistake was trying to parse the formula from string instead of passing in the first model's existing formula attribute - and I was missing some numpy tricks like flattening the Y vector.

I do think that's publicly exposed (as in it can be imported) but ~~it's not documented~~. 🤷‍♂️ naming is hard.

Edit: Reading is hard.

After incorporating some of your tweaks tweaks though this actually works quite well - so many thanks for that!

For posterity, this is what worked for me:

from pyfixest.estimation.model_matrix_fixest_ import model_matrix_fixest
middle_index = len(df_sample) // 2
top_half = df_sample.iloc[:middle_index]
bottom_half = df_sample.iloc[middle_index:]

incremental_model = pf.feols(
    'metric_value ~ variant * (covariate_1 + covariate_2 + C(covariate_3))',
    data = top_half,
    copy_data=False
)

mm = model_matrix_fixest(incremental_model.FixestFormula, bottom_half.reset_index())

incremental_model.update(mm['X'].to_numpy(), mm['Y'].to_numpy().flatten())

Though reading more about it, seems that update won't work with predict, so I think I'll go the route of optimizing the one-shot.

copy_data=False has given me better performance, but I still get kernel crashing, jupyter simply says

Kernel Restarting
The kernel for demo.ipynb appears to have died. It will restart automatically.

I've got on the order of millions of observations, and I'm trying to see how far I can push the cardinality of covariates. So far I can't get to 1000. Part of why I'm so vague on the estimates is that I don't actually know ahead of time. This regression will be specified by users of our internal A/B testing tool, so the number of observations varies by the number of subjects in the A/B test, and the number of columns varies by the regression the data scientist chooses to specify.

I did see your email with the slides - haven't had a chance to go through them fully but will see what other tricks I can use. Much appreciate the help so far!

0 replies

Help understanding how to use the update functionality #825

Uh oh!

DTchebotarev Mar 4, 2025

Replies: 4 comments · 2 replies

Uh oh!

Uh oh!

s3alfisc Mar 4, 2025 Maintainer

Uh oh!

Uh oh!

s3alfisc Mar 4, 2025 Maintainer

Uh oh!

s3alfisc Mar 4, 2025 Maintainer

Uh oh!

apoorvalal Mar 5, 2025 Collaborator

Uh oh!

DTchebotarev Mar 7, 2025 Author

Uh oh!

Uh oh!

DTchebotarev Mar 6, 2025 Author

Help understanding how to use the `update` functionality #825

DTchebotarev
Mar 4, 2025

Replies: 4 comments 2 replies

s3alfisc
Mar 4, 2025
Maintainer

s3alfisc
Mar 4, 2025
Maintainer

s3alfisc
Mar 4, 2025
Maintainer

apoorvalal Mar 5, 2025
Collaborator

DTchebotarev Mar 7, 2025
Author

DTchebotarev
Mar 6, 2025
Author