Skip to content

adds vif metric from stats models#676

Open
kikiluvbrains wants to merge 13 commits intomne-tools:mainfrom
kikiluvbrains:glm-multicollinearity
Open

adds vif metric from stats models#676
kikiluvbrains wants to merge 13 commits intomne-tools:mainfrom
kikiluvbrains:glm-multicollinearity

Conversation

@kikiluvbrains
Copy link
Copy Markdown

Reference issue

Multi-collinearity #413.

What does this implement/fix?

Add metrics to quantify collinearity in the design matrix. One simple way to deal with high multi-collinearity (typically VIF > 4–5) is to combine very similar regressors or drop the ones that are problematic.

There are other ways to handle high VIF, such as using principal component analysis (PCA), but I haven’t explored those yet, I am open to suggestions.

Copy link
Copy Markdown
Member

@larsoner larsoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start!

import mne
import numpy as np
from mne.utils import logger
from statsmodels.stats.outliers_influence import variance_inflation_factor
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statsmodels is not currently a dependency, so this functionality should probably be optional (or we can start to require statsmodels... but for a single check in a single function it seems like overkill.) So can you

  1. Nest this import in a try/except, then only proceed with the VIF analysis if it's present (and logger.info that the check was skipped if it wasn't done)
  2. Add to some group statsmodels so that some tests / CIs use it
  3. Add some test that when statsmodels is installed (e.g., using pytest.importorskip("statsmodels") at the top of the test) the VIF is reported for some make_first_level_design_matrix calls? Bonus points if you add one that logger.warns because it's bad, or find one that's already bad and assert it.

We could add a new group full to the optional-dependencies here that has statsmodels in it for example

docs = ["sphinx", "sphinx-gallery", "pydata_sphinx_theme", "numpydoc", "matplotlib"]

Then make sure at least some of the CIs use it.

And CircleCI should definitely use it, so we can see the output in some examples...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thank you so much for your feedback!

I think we should opt out of statsmodels, I wrote an equivalent test for variance_inflation_factor, and also verified with some test data which gave me equivalent results to statsmodels functionality.

let me know what you think.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if it's simple enough feel free to do it that way

You could even write a little helper test that uses statsmodels (only in the test!) to verify equivalence of the helper function if you want (and then have CIs install it just for testing)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, I will add in a little helper test

@larsoner
Copy link
Copy Markdown
Member

Let me know when I should look!

@kikiluvbrains
Copy link
Copy Markdown
Author

@larsoner I am ready for you to take a look, currently am trying to pass all checks

@kikiluvbrains
Copy link
Copy Markdown
Author

@larsoner

I do have an optional flag for "vif_export" as a way to test against stasmodel, but it causes this error in some of the html examples as shown in circleCI, let me know what you think, and best way I can patch this?

        ../examples/general/plot_11b_kf2_fingertapping.py failed leaving traceback:
    
        Traceback (most recent call last):
          File "/home/circleci/project/examples/general/plot_11b_kf2_fingertapping.py", line 416, in <module>
            design_matrix = make_first_level_design_matrix(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          File "/home/circleci/project/mne_nirs/experimental_design/_experimental_design.py", line 106, in make_first_level_design_matrix
            dm = make_first_level_design_matrix(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        TypeError: make_first_level_design_matrix() got an unexpected keyword argument 'vif_export'

import numpy as np

# for the comparison of vif we need these two libraries
from statsmodels.stats.outliers_influence import variance_inflation_factor
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be imported inside the test in case people don't have it installed, and in the test you can hav.

pytest.importorskip("statsmodels")

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

# wheras statsmodel has their own implmentation before extracting the vif values
# note vif will come with a level of uncertainity +/- 0.05 of what is reported
for key in vif:
<<<<<<< HEAD
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some problem with rebasing

Comment on lines +77 to +78
vif export : bool, optional
deafult set to false, if set to True will export vif values;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this probably shouldn't be part of the public API, so let's remove it

If we really had to have a way to get the values for the test we would want to create a private _make_first_level_design_matrix with this option available and the public make_first_level_design_matrix would always call it with return_vif=False, and the test could use the private one with return_vif=True

Comment on lines +148 to +153
for name, vif_idx in zip(predictor_names, vif_all):
msg = f"{name} with VIF of {vif_idx:.3f}"
if vif_idx > 4:
logger.warning("High collinearity " + msg)
else:
logger.info(msg)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are already logging the VIFs so in the test you can do something like the following to recover the values:

from mne.utils import use_log_level, catch_logging

...
def some_test():
    ...
    with use_log_level("info"), catch_logging() as log:
        ... make_first_level_design_matrix(...)
    log = log.getvalue()
    vifs = np.array([line.split()[-1] for line in log.splitlines() if " VIF " in line], float)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants