Skip to content

[BUG] lfc_shrink() generates wrong results #347

Open
@martinPasen

Description

@martinPasen

Describe the bug
lfc_shrink() generates wrong results. In most cases it acts as expected. For some genes it increases absolute value of fold change. For one differentially expressed gene with high expression (base_mean = 407277.128029) it even changes the sign (
log2FoldChange = -0.753026; shrankLog2FoldChange = 1.844100).

To Reproduce

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats

inference = DefaultInference(n_cpus=32)

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata_df,
    design_factors='condition',
    inference=inference,
)
dds.deseq2()

stat_res = DeseqStats(
    dds,
    contrast=['condition', 'B', 'A'],
    inference=inference
)
stat_res.summary()
stat_res_df = stat_res.results_df.copy()

stat_res.lfc_shrink('condition_B_vs_A')
stat_res_shrinked_df = stat_res.results_df.copy()
stat_res_shrinked_df = stat_res_shrinked_df.rename(columns={'log2FoldChange': 'log2FoldChange_shrank'})

merged_df = pd.merge(stat_res_df, stat_res_shrinked_df, left_index=True, right_index=True)
merged_df = merged_df[(merged_df['padj_x'].notna())]

sns.scatterplot(
    merged_df,
    x='log2FoldChange',
    y='log2FoldChange_shrank'
)

# replace the sns.scatterplot() with code below to get rid of seaborn dependency
# plt.scatter(
#     x=merged_df['log2FoldChange'],
#     y=merged_df['log2FoldChange_shrank']
#)

plt.axline((0, 0), (1,1))
Fitting size factors...
... done in 0.02 seconds.

Fitting dispersions...
... done in 1.83 seconds.

Fitting dispersion trend curve...
... done in 0.55 seconds.

Fitting MAP dispersions...
... done in 1.79 seconds.

Fitting LFCs...
... done in 2.18 seconds.

Calculating cook's distance...
... done in 0.02 seconds.

Replacing 0 outlier genes.

Running Wald tests...
... done in 2.47 seconds.

Fitting MAP LFCs...
Log2 fold change & Wald test p-value: condition B vs A
                       baseMean  log2FoldChange     lfcSE      stat    pvalue  \
GENEID                                                                          
ENSG00000000003.15  6312.552040       -0.406453  0.092641 -4.387384  0.000011   
ENSG00000000005.6      5.337949       -5.852983  2.030001 -2.883242  0.003936   
ENSG00000000419.14  3926.067459        0.396839  0.089676  4.425236  0.000010   
ENSG00000000457.14   986.626013       -0.033945  0.132989 -0.255247  0.798532   
ENSG00000000460.17  3054.441682       -0.008417  0.079740 -0.105558  0.915933   
...                         ...             ...       ...       ...       ...   
ENSG00000289714.1      0.000000             NaN       NaN       NaN       NaN   
ENSG00000289715.1      0.000000             NaN       NaN       NaN       NaN   
ENSG00000289716.1     55.728811        1.269762  0.336892  3.769053  0.000164   
ENSG00000289718.1      0.000000             NaN       NaN       NaN       NaN   
ENSG00000289719.1     11.956688        2.075506  0.837589  2.477953  0.013214   

                        padj  
GENEID                        
ENSG00000000003.15  0.000048  
ENSG00000000005.6   0.010336  
ENSG00000000419.14  0.000041  
ENSG00000000457.14  0.863010  
ENSG00000000460.17  0.944234  
...                      ...  
ENSG00000289714.1        NaN  
ENSG00000289715.1        NaN  
ENSG00000289716.1   0.000564  
ENSG00000289718.1        NaN  
ENSG00000289719.1   0.030276  

[61125 rows x 6 columns]
/home/.../pydeseq2/utils.py:1260: RuntimeWarning: overflow encountered in exp
  counts - (counts + size) / (1 + size * np.exp(-xbeta - offset))

Shrunk log2 fold change & Wald test p-value: condition B vs A
                       baseMean  log2FoldChange     lfcSE      stat    pvalue  \
GENEID                                                                          
ENSG00000000003.15  6312.552040       -0.405783  0.092409 -4.387384  0.000011   
ENSG00000000005.6      5.337949       -6.123571  2.817030 -2.883242  0.003936   
ENSG00000000419.14  3926.067459        0.387581  0.089467  4.425236  0.000010   
ENSG00000000457.14   986.626013       -0.035274  0.131941 -0.255247  0.798532   
ENSG00000000460.17  3054.441682       -0.007822  0.079505 -0.105558  0.915933   
...                         ...             ...       ...       ...       ...   
ENSG00000289714.1      0.000000             NaN       NaN       NaN       NaN   
ENSG00000289715.1      0.000000             NaN       NaN       NaN       NaN   
ENSG00000289716.1     55.728811        1.162090  0.337499  3.769053  0.000164   
ENSG00000289718.1      0.000000             NaN       NaN       NaN       NaN   
ENSG00000289719.1     11.956688        1.440262  0.866340  2.477953  0.013214   

                        padj  
GENEID                        
ENSG00000000003.15  0.000048  
ENSG00000000005.6   0.010336  
ENSG00000000419.14  0.000041  
ENSG00000000457.14  0.863010  
ENSG00000000460.17  0.944234  
...                      ...  
ENSG00000289714.1        NaN  
ENSG00000289715.1        NaN  
ENSG00000289716.1   0.000564  
ENSG00000289718.1        NaN  
ENSG00000289719.1   0.030276  

[61125 rows x 6 columns]
... done in 1.97 seconds.

Expected behavior
No increase in absolute value of fold change (if i understand correctly what apeGLM shrinking is doing).
No change in direction of fold change in very abundant differentially expressed genes.
No instability based on the name of conditions.

Screenshots
image

Desktop (please complete the following information):

  • OS: Ubuntu 20.04.3 LTS
  • pydeseq2: 0.4.12

Additional context
The behaviour is not stable based on the name of groups. If i exchange A to B the values are different and the overflow error doent pop up, however the results are still suspicious. The data i can not publicly share, but potentially i would be able to share them withthout geneids. The most weird and unstable behaviour was seen in genes with high expression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions