Description
This bit has been harder to compare than earlier parts, but I think I can say with a high degree of confidence now that RANSAC correlations differ between PyPREP and MATLAB PREP.
Before I get into the details, some other relevant stuff I've learned:
- MatPREP's internal
spherical_interpolation
function and MNE's internal_make_interpolation_matrix
functions produce identical outputs, given the same inputs! This is a huge relief, and one less thing we have to worry about or figure out. - Conversely, there's something weird about how MNE's
mne.channels.read_custom_montage
imports.elp
files, like the one bundled with EEGLAB we've been using for the MatPREP data. Specifically, if I just load my pre-cleanlined and montage'd .set/.fdt files into MNE, the Cartesian X/Y/Z coordinates for the electrodes more-or-less line up with MATLAB (the X and Y axes are swapped) and the interpolation matrices match exactly. However, if I re-read thestandard-10-5-cap385.elp
montage into MNE and apply it to the file, the coordinates and interpolation matrices both differ quite a bit. Not sure what this is all about, but regardless I don't think it's a PyPREP bug.
Anyway, the gist of the difference (and why it's been so hard to dig into): MatPREP and PyPREP have two very different approaches of getting the same RANSAC correlation results. Here's a quick overview:
MATLAB PREP's RANSAC:
- Choose random channels for each RANSAC sample.
- Generate interpolation matrix for all channels for each RANSAC sample.
- For each 5 second RANSAC window in the data, calculate the predicted signal for each channel across each of the 50 RANSAC samples, and then correlate the predicted values with the actual values to get a correlation for each channel in that window.
- After calculating a matrix of correlations of size [number_of_ransac_windows, number_of_channels], apply thresholding and identify channels that fall below target.
PyPREP's RANSAC:
- Choose random channels for each RANSAC sample.
- Determine the amount of available RAM and split the channels-to-predict into chunks accordingly.
- For each chunk of channels to predict:
- For each of the 50 RANSAC samples, calculate the interpolation matrix and predict the full signal for those channels, saving the output into one large
eeg_predictions
matrix. - After interpolating the signal for all channels in the chunk for all 50 samples, calculate the median EEG signal for each channel from those 50 individual predictions.
- For each 5 second RANSAC window in the data, calculate the correlation between the median predicted signal and the actual signal for the channels in that chunk.
- For each of the 50 RANSAC samples, calculate the interpolation matrix and predict the full signal for those channels, saving the output into one large
- After calculating a matrix of correlations of size [number_of_ransac_windows, number_of_channels], apply thresholding and identify channels that fall below target.
Anyway, the output correlations of each implementation differ, even with the same input. I know there's a deliberate difference somewhere regarding a median calculation, but I couldn't remember what it was and given how different the code layout is it hasn't popped out to me. @sappelhoff @yjmantilla @jasmainak do any of you remember what this difference was, so I can try reversing it and see if it makes things match up?
Also regardless of other factors, the MatPREP approach seems to have much lower RAM demands since at no point are they estimating signals across the whole duration of the file. Should we consider changing PyPREP to work in a similar manner?