adds vif metric from stats models#676
Conversation
for more information, see https://pre-commit.ci
| import mne | ||
| import numpy as np | ||
| from mne.utils import logger | ||
| from statsmodels.stats.outliers_influence import variance_inflation_factor |
There was a problem hiding this comment.
statsmodels is not currently a dependency, so this functionality should probably be optional (or we can start to require statsmodels... but for a single check in a single function it seems like overkill.) So can you
- Nest this import in a
try/except, then only proceed with the VIF analysis if it's present (andlogger.infothat the check was skipped if it wasn't done) - Add to some group
statsmodelsso that some tests / CIs use it - Add some test that when statsmodels is installed (e.g., using
pytest.importorskip("statsmodels")at the top of the test) the VIF is reported for somemake_first_level_design_matrixcalls? Bonus points if you add one thatlogger.warns because it's bad, or find one that's already bad and assert it.
We could add a new group full to the optional-dependencies here that has statsmodels in it for example
Line 53 in 5acfff0
Then make sure at least some of the CIs use it.
And CircleCI should definitely use it, so we can see the output in some examples...
There was a problem hiding this comment.
Hey, thank you so much for your feedback!
I think we should opt out of statsmodels, I wrote an equivalent test for variance_inflation_factor, and also verified with some test data which gave me equivalent results to statsmodels functionality.
let me know what you think.
There was a problem hiding this comment.
Yeah if it's simple enough feel free to do it that way
You could even write a little helper test that uses statsmodels (only in the test!) to verify equivalence of the helper function if you want (and then have CIs install it just for testing)
There was a problem hiding this comment.
sounds good, I will add in a little helper test
…nirs into glm-multicollinearity
for more information, see https://pre-commit.ci
|
Let me know when I should look! |
…nirs into glm-multicollinearity
for more information, see https://pre-commit.ci
|
@larsoner I am ready for you to take a look, currently am trying to pass all checks |
|
I do have an optional flag for "vif_export" as a way to test against stasmodel, but it causes this error in some of the html examples as shown in circleCI, let me know what you think, and best way I can patch this? |
| import numpy as np | ||
|
|
||
| # for the comparison of vif we need these two libraries | ||
| from statsmodels.stats.outliers_influence import variance_inflation_factor |
There was a problem hiding this comment.
This should be imported inside the test in case people don't have it installed, and in the test you can hav.
pytest.importorskip("statsmodels")
| # wheras statsmodel has their own implmentation before extracting the vif values | ||
| # note vif will come with a level of uncertainity +/- 0.05 of what is reported | ||
| for key in vif: | ||
| <<<<<<< HEAD |
There was a problem hiding this comment.
Looks like some problem with rebasing
| vif export : bool, optional | ||
| deafult set to false, if set to True will export vif values; |
There was a problem hiding this comment.
Hmm... this probably shouldn't be part of the public API, so let's remove it
If we really had to have a way to get the values for the test we would want to create a private _make_first_level_design_matrix with this option available and the public make_first_level_design_matrix would always call it with return_vif=False, and the test could use the private one with return_vif=True
| for name, vif_idx in zip(predictor_names, vif_all): | ||
| msg = f"{name} with VIF of {vif_idx:.3f}" | ||
| if vif_idx > 4: | ||
| logger.warning("High collinearity " + msg) | ||
| else: | ||
| logger.info(msg) |
There was a problem hiding this comment.
You are already logging the VIFs so in the test you can do something like the following to recover the values:
from mne.utils import use_log_level, catch_logging
...
def some_test():
...
with use_log_level("info"), catch_logging() as log:
... make_first_level_design_matrix(...)
log = log.getvalue()
vifs = np.array([line.split()[-1] for line in log.splitlines() if " VIF " in line], float)
Reference issue
Multi-collinearity #413.
What does this implement/fix?
Add metrics to quantify collinearity in the design matrix. One simple way to deal with high multi-collinearity (typically VIF > 4–5) is to combine very similar regressors or drop the ones that are problematic.
There are other ways to handle high VIF, such as using principal component analysis (PCA), but I haven’t explored those yet, I am open to suggestions.