Preprocessing Gap Analysis: What's Done vs. What's Missing

Summary from Paper (via Gemini + PDF)

Preprocessing Pipeline Described in Paper

Deconvolution: Fast non-negative deconvolution on ΔF/F₀ traces
Temporal downsampling: 2× downsampling (sum adjacent bins) → 0.276s bins
Locomotion filtering: Exclude trials with speed > 0.2 mm/s
Time integration: Integrate spike counts over [0.5s, 2.0s]
Mean response subtraction: "the mean stimulus-evoked response of the cell was subtracted from each trace after separating the trials for each of the two visual stimuli"
Correlation computation: Pearson correlation on residuals

What's Already in Your Dataset ✅

Based on the parquet file structure and your analysis:

✅ 1. Deconvolution - ALREADY DONE

Your amplitude column contains deconvolved spike estimates
Non-negative deconvolution ensures no negative values
Status: Pre-applied to data

✅ 2. Temporal Downsampling - ALREADY DONE

You have 14 time bins at 0.275s resolution (close to paper's 0.276s)
This matches the 2× downsampling mentioned
Status: Pre-applied to data

✅ 3. Locomotion Filtering - ALREADY DONE

No locomotion_speed column in dataset
Trial counts (217-332 per stimulus) match paper's post-filtering range
Status: Pre-applied to data

What You're Currently Doing ✅

✅ 4. Time Integration - CORRECTLY IMPLEMENTED

Your code (trial_filtering.py line 50):

.agg(pl.col('amplitude').sum().alias('integrated_amplitude'))

Status: ✅ Correctly summing over time bins [start_bin, end_bin]

✅ 5. Mean Response Subtraction - CORRECTLY IMPLEMENTED!

Your code (noise_correlations.py lines 56-63):

for stim in unique_stimuli:
    # Extract trials for this stimulus
    mask = stimulus_labels == stim
    responses_i = cell_i_responses[mask]
    responses_j = cell_j_responses[mask]
    
    # Remove mean (isolate noise)
    noise_i = responses_i - responses_i.mean()
    noise_j = responses_j - responses_j.mean()

Status: ✅ You ARE doing this correctly!

Separating trials by stimulus
Subtracting mean per stimulus per cell
Computing correlations on residuals

🤔 So Why the Discrepancy? (0.037 vs 0.06)

Since you're implementing the analysis pipeline correctly, the discrepancy must come from:

Hypothesis 1: Amplitude Column Preprocessing ⚠️

The amplitude values in your parquet file might have different preprocessing than what the paper used in their analysis.

What Gemini Found:

Paper used "fast non-negative deconvolution"
BUT: No details on exact algorithm or parameters

Possible issues:

Different deconvolution algorithm used to generate your dataset
Different deconvolution parameters (tau, threshold, etc.)
Additional preprocessing applied to dataset that wasn't mentioned
Dataset is from different analysis (e.g., Extended Data uses threshold 0.5)

Hypothesis 2: Time Window Difference ⚠️

Paper: [0.5s, 2.0s]
Your data: 14 bins at 0.275s = [0s, 3.85s] total

Current config (analysis_config.yaml):

integration_method: "discrete_extended"  # bins [1,7] = [0.275s, 2.20s]

Bins to seconds:

Bin 1: 0.275s
Bin 7: 2.20s (exclusive end)
Duration: 1.925s

Paper target: [0.5s, 2.0s] = 1.5s duration

Your window: [0.275s, 2.20s] = 1.925s duration

This is actually longer than the paper's window and includes earlier times. Let me calculate what bins exactly match [0.5s, 2.0s]:

0.5s / 0.275s = 1.818 → Start at bin 2 (0.55s)
2.0s / 0.275s = 7.273 → End at bin 7 (2.20s)

You should test bins [2,7] (conservative method) vs bins [1,7] (extended method).

Hypothesis 3: Spike Threshold ⚠️

From Gemini: Extended Data Fig 4c used threshold 0.5 for spike counts.

Current implementation: No thresholding applied.

Test: Apply threshold where amplitude < 0.5 is set to 0 before time integration.

Hypothesis 4: Data Source Mismatch ⚠️

Critical question: Is your dataset the exact same data used for Figure 2d/2e?

Possibilities:

Dataset might be for different figures
Dataset might be preliminary/processed version
Dataset might include trials/cells that were filtered out in paper

Action Plan: Systematic Testing

Test 1: Time Window (Quick)

You already have configs for this:

# Test bins [2,7] - conservative
integration_method: "discrete_conservative"

# vs bins [1,7] - extended  
integration_method: "discrete_extended"

Run both and compare:

# Edit config to use discrete_conservative
python run_analysis.py
# Note the mean correlation

# Edit config to use discrete_extended  
python run_analysis.py
# Note the mean correlation

Test 2: Spike Threshold (Medium)

Implement amplitude thresholding:

Add to trial_filtering.py before integration:

def apply_spike_threshold(
    data: pl.DataFrame,
    threshold: float = 0.5
) -> pl.DataFrame:
    """
    Apply spike detection threshold to amplitude values.
    
    Paper: Extended Data Fig 4c used threshold 0.5 for spike counts.
    
    Args:
        data: Neural data with 'amplitude' column
        threshold: Minimum amplitude to count as spike
        
    Returns:
        Data with thresholded amplitudes
    """
    return data.with_columns(
        pl.when(pl.col('amplitude') >= threshold)
        .then(pl.col('amplitude'))
        .otherwise(0.0)
        .alias('amplitude')
    )

Update config:

preprocessing:
  # Spike detection threshold
  spike_threshold:
    enabled: true
    value: 0.5  # From Extended Data Fig 4c

Test: Run with threshold 0.5 and compare.

Test 3: Ask Author (High Priority!)

Email the author with specific questions:

Hi [Author],

I'm working with the coding_fidelity_bounds.dataset.parquet file to recreate Figure 2d/2e from your Nature 2020 paper.

I've implemented the analysis pipeline correctly (per-stimulus mean subtraction, time integration [0.5s-2.0s], correlation averaging), but I'm getting mean noise correlation = 0.037 vs the paper's 0.06.

Questions about the dataset:

Does the amplitude column in the parquet file represent the raw output of fast non-negative deconvolution, or was additional preprocessing applied?

For the main noise correlation analysis (Fig 2d/2e), was any spike threshold applied to the amplitude values? (I see threshold 0.5 was used for Extended Data Fig 4c)

What are the exact time bin indices you used for the [0.5s, 2.0s] window? With 0.275s bins, I calculate:

0.5s / 0.275s = 1.82 → bin 2?

2.0s / 0.275s = 7.27 → bin 7?

Can you confirm the dataset is the exact data used for Figure 2d/2e?

Is there any additional preprocessing applied to amplitudes between deconvolution and correlation analysis that isn't mentioned in the methods?

Thanks for your help!

Current Status Summary

Step	Paper Description	Your Implementation	Status
Deconvolution	Fast non-negative	Pre-applied in dataset	✅ Done
Downsampling	2× (0.276s bins)	0.275s bins in data	✅ Done
Locomotion filter	Speed < 0.2 mm/s	Pre-applied (no column)	✅ Done
Time integration	[0.5s, 2.0s]	Bins [1,7] or [2,7]	⚠️ Test both
Mean subtraction	Per stimulus, per cell	`noise_i = responses_i - mean`	✅ Correct
Spike threshold	Mentioned for Extended Data Fig 4c	Not applied	⚠️ Test 0.5
Correlation	Pearson, averaged	`np.corrcoef()` averaged	✅ Correct

Most Likely Causes (Ranked)

Time window mismatch (bins [1,7] vs [2,7])
- Quick to test
- Could significantly affect results
Spike threshold 0.5 not applied
- Would reduce noise floor
- Increase correlation magnitude
Amplitude preprocessing in dataset differs from analysis
- Need author confirmation
- Can't fix without knowing what was done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing Gap Analysis: What's Done vs. What's Missing

Summary from Paper (via Gemini + PDF)

Preprocessing Pipeline Described in Paper

What's Already in Your Dataset ✅

✅ 1. Deconvolution - ALREADY DONE

✅ 2. Temporal Downsampling - ALREADY DONE

✅ 3. Locomotion Filtering - ALREADY DONE

What You're Currently Doing ✅

✅ 4. Time Integration - CORRECTLY IMPLEMENTED

✅ 5. Mean Response Subtraction - CORRECTLY IMPLEMENTED!

🤔 So Why the Discrepancy? (0.037 vs 0.06)

Hypothesis 1: Amplitude Column Preprocessing ⚠️

Hypothesis 2: Time Window Difference ⚠️

Hypothesis 3: Spike Threshold ⚠️

Hypothesis 4: Data Source Mismatch ⚠️

Action Plan: Systematic Testing

Test 1: Time Window (Quick)

Test 2: Spike Threshold (Medium)

Test 3: Ask Author (High Priority!)

Current Status Summary

Most Likely Causes (Ranked)

FilesExpand file tree

PREPROCESSING_GAP_ANALYSIS.md

Latest commit

History

PREPROCESSING_GAP_ANALYSIS.md

File metadata and controls

Preprocessing Gap Analysis: What's Done vs. What's Missing

Summary from Paper (via Gemini + PDF)

Preprocessing Pipeline Described in Paper

What's Already in Your Dataset ✅

✅ 1. Deconvolution - ALREADY DONE

✅ 2. Temporal Downsampling - ALREADY DONE

✅ 3. Locomotion Filtering - ALREADY DONE

What You're Currently Doing ✅

✅ 4. Time Integration - CORRECTLY IMPLEMENTED

✅ 5. Mean Response Subtraction - CORRECTLY IMPLEMENTED!

🤔 So Why the Discrepancy? (0.037 vs 0.06)

Hypothesis 1: Amplitude Column Preprocessing ⚠️

Hypothesis 2: Time Window Difference ⚠️

Hypothesis 3: Spike Threshold ⚠️

Hypothesis 4: Data Source Mismatch ⚠️

Action Plan: Systematic Testing

Test 1: Time Window (Quick)

Test 2: Spike Threshold (Medium)

Test 3: Ask Author (High Priority!)

Current Status Summary

Most Likely Causes (Ranked)