Description
Describe the bug
lfc_shrink() generates wrong results. In most cases it acts as expected. For some genes it increases absolute value of fold change. For one differentially expressed gene with high expression (base_mean = 407277.128029) it even changes the sign (
log2FoldChange = -0.753026; shrankLog2FoldChange = 1.844100).
To Reproduce
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
inference = DefaultInference(n_cpus=32)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata_df,
design_factors='condition',
inference=inference,
)
dds.deseq2()
stat_res = DeseqStats(
dds,
contrast=['condition', 'B', 'A'],
inference=inference
)
stat_res.summary()
stat_res_df = stat_res.results_df.copy()
stat_res.lfc_shrink('condition_B_vs_A')
stat_res_shrinked_df = stat_res.results_df.copy()
stat_res_shrinked_df = stat_res_shrinked_df.rename(columns={'log2FoldChange': 'log2FoldChange_shrank'})
merged_df = pd.merge(stat_res_df, stat_res_shrinked_df, left_index=True, right_index=True)
merged_df = merged_df[(merged_df['padj_x'].notna())]
sns.scatterplot(
merged_df,
x='log2FoldChange',
y='log2FoldChange_shrank'
)
# replace the sns.scatterplot() with code below to get rid of seaborn dependency
# plt.scatter(
# x=merged_df['log2FoldChange'],
# y=merged_df['log2FoldChange_shrank']
#)
plt.axline((0, 0), (1,1))
Fitting size factors...
... done in 0.02 seconds.
Fitting dispersions...
... done in 1.83 seconds.
Fitting dispersion trend curve...
... done in 0.55 seconds.
Fitting MAP dispersions...
... done in 1.79 seconds.
Fitting LFCs...
... done in 2.18 seconds.
Calculating cook's distance...
... done in 0.02 seconds.
Replacing 0 outlier genes.
Running Wald tests...
... done in 2.47 seconds.
Fitting MAP LFCs...
Log2 fold change & Wald test p-value: condition B vs A
baseMean log2FoldChange lfcSE stat pvalue \
GENEID
ENSG00000000003.15 6312.552040 -0.406453 0.092641 -4.387384 0.000011
ENSG00000000005.6 5.337949 -5.852983 2.030001 -2.883242 0.003936
ENSG00000000419.14 3926.067459 0.396839 0.089676 4.425236 0.000010
ENSG00000000457.14 986.626013 -0.033945 0.132989 -0.255247 0.798532
ENSG00000000460.17 3054.441682 -0.008417 0.079740 -0.105558 0.915933
... ... ... ... ... ...
ENSG00000289714.1 0.000000 NaN NaN NaN NaN
ENSG00000289715.1 0.000000 NaN NaN NaN NaN
ENSG00000289716.1 55.728811 1.269762 0.336892 3.769053 0.000164
ENSG00000289718.1 0.000000 NaN NaN NaN NaN
ENSG00000289719.1 11.956688 2.075506 0.837589 2.477953 0.013214
padj
GENEID
ENSG00000000003.15 0.000048
ENSG00000000005.6 0.010336
ENSG00000000419.14 0.000041
ENSG00000000457.14 0.863010
ENSG00000000460.17 0.944234
... ...
ENSG00000289714.1 NaN
ENSG00000289715.1 NaN
ENSG00000289716.1 0.000564
ENSG00000289718.1 NaN
ENSG00000289719.1 0.030276
[61125 rows x 6 columns]
/home/.../pydeseq2/utils.py:1260: RuntimeWarning: overflow encountered in exp
counts - (counts + size) / (1 + size * np.exp(-xbeta - offset))
Shrunk log2 fold change & Wald test p-value: condition B vs A
baseMean log2FoldChange lfcSE stat pvalue \
GENEID
ENSG00000000003.15 6312.552040 -0.405783 0.092409 -4.387384 0.000011
ENSG00000000005.6 5.337949 -6.123571 2.817030 -2.883242 0.003936
ENSG00000000419.14 3926.067459 0.387581 0.089467 4.425236 0.000010
ENSG00000000457.14 986.626013 -0.035274 0.131941 -0.255247 0.798532
ENSG00000000460.17 3054.441682 -0.007822 0.079505 -0.105558 0.915933
... ... ... ... ... ...
ENSG00000289714.1 0.000000 NaN NaN NaN NaN
ENSG00000289715.1 0.000000 NaN NaN NaN NaN
ENSG00000289716.1 55.728811 1.162090 0.337499 3.769053 0.000164
ENSG00000289718.1 0.000000 NaN NaN NaN NaN
ENSG00000289719.1 11.956688 1.440262 0.866340 2.477953 0.013214
padj
GENEID
ENSG00000000003.15 0.000048
ENSG00000000005.6 0.010336
ENSG00000000419.14 0.000041
ENSG00000000457.14 0.863010
ENSG00000000460.17 0.944234
... ...
ENSG00000289714.1 NaN
ENSG00000289715.1 NaN
ENSG00000289716.1 0.000564
ENSG00000289718.1 NaN
ENSG00000289719.1 0.030276
[61125 rows x 6 columns]
... done in 1.97 seconds.
Expected behavior
No increase in absolute value of fold change (if i understand correctly what apeGLM shrinking is doing).
No change in direction of fold change in very abundant differentially expressed genes.
No instability based on the name of conditions.
Desktop (please complete the following information):
- OS: Ubuntu 20.04.3 LTS
- pydeseq2: 0.4.12
Additional context
The behaviour is not stable based on the name of groups. If i exchange A to B the values are different and the overflow error doent pop up, however the results are still suspicious. The data i can not publicly share, but potentially i would be able to share them withthout geneids. The most weird and unstable behaviour was seen in genes with high expression.
Activity
BorisMuzellec commentedon Dec 5, 2024
Hi @martinPasen, thanks for reporting this bug.
I understand that you cannot share your data, but do you think you could design a dummy example from it (maybe a few dummy genes and samples) on which you encounter similar issues?
martinPasen commentedon Dec 6, 2024
Hi @BorisMuzellec,
thanks for the reply and thank you all for very nice package.
This is a self contained example:
BorisMuzellec commentedon Dec 6, 2024
Thanks, I've been able to reproduce the issue. I'll try to give it a look next week.
BorisMuzellec commentedon Dec 6, 2024
@martinPasen it seems like decreasing
ftol
andgtol
in thenbinomGLM
solver does the trick:DESeq2 output:
PyDESeq2 output (with the new convergence criteria):
I'll try to open a PR on Monday if I have time
nbinomGLM
returns before convergence #349wgsim commentedon Feb 27, 2025
Hi, pyDESeq2 team. Is this bug solved now? Because I experienced the same results from my data: bigger absolute value in shrunk results, with the same runtime warning
"RuntimeWarning: overflow encountered in exp
counts - (counts + size) / (1 + size * np.exp(-xbeta - offset))"
There are no huge sign change issues.
OS: Apple Silicon M2 Max
pyDESeq2 Version: 0.5.0
Environment: Conda
P.S. Can it happen because I used only a subset of samples for group comparison?
I have 35 samples in total but I got the above error when I used two groups with 5 samples each for DESeq2 without any sample filtering. However, when I filtered the rest 20 samples before processing, there was no error like the above.
BorisMuzellec commentedon Feb 28, 2025
Hi @wgsim, thanks for reporting this. This looks like a numerical stability issue.
Could you try forking this PR (#370) and re-running your code?
Alternatively, could you provide an example with some data (it can be synthetic) for which you're having the same issue?
wgsim commentedon Feb 28, 2025
Hi @BorisMuzellec, Thank you for the suggestion. First, I tried to follow #370 and re-run it, but the results are identical to these errors.
So, I add these two files for the test: matrix and metadata files. I performed DESeq2 between group1 and group2.
test_data_for_pydeseq2.txt
sample.metadata.test_for_pydeseq2.txt
Thank you!