Description
Code Sample
import pandas as pd
import numpy as np
data = pd.DataFrame(np.ones(60000).astype('float32') * 1e6)
print(np.mean(data))
print(np.mean(data.values))
print(np.std(data))
print(np.std(data.values))
which results in the following output
0 999645.0
dtype: float32
1000000.06
0 354.936188
dtype: float32
0.0625
Problem description
I have a pandas dataframe with 32-bit floating numbers inside and try to calculate the mean
and std
values, where these calculations can go horribly wrong. However I am able reproduce the issue with the example above.
I know pandas uses internal functions for mean and std instead of using the numpy functions. This seems to be the issue here. If I add .values
the result is a lot more accurate. The problem disappears if I use float64
.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: de_DE.cp1252
pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.5
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Activity
TomAugspurger commentedon Aug 16, 2018
Looks like a bottleneck issue:
You may want to check with them to see if this is expected.
You can disable bottleneck with
pd.options.compute.use_bottleneck
.TomAugspurger commentedon Aug 16, 2018
Perhaps a known issue: pydata/bottleneck#193
That references nansum / nanmean, but not nanstd.
I don't think it's worth silently astyping to float64 when people have float32 data and are using bottleneck, but I would be curious to hear other's thoughts.
m-rossi commentedon Aug 16, 2018
Thanks, I will use your suggestion for now and we should watch the issue at bottleneck. But as I said, for my real data the values were horribly wrong, especially for the std.
TomAugspurger commentedon Aug 16, 2018
Indeed. My usual inclination is to fix this upstream and not add temporary workarounds, but given the incorrectness of the result, workarounds may be worth considering.
agoodm commentedon Aug 17, 2018
I am the one who opened the associated issue in the bottleneck repo because it lead to significant errors for a workflow I was doing with xarray, see: pydata/xarray#2370
Yes it also affects
nanstd
, and yes it can lead to far more significant errors than what OP is showing under the right conditions (a large enough sample with a relatively narrow distribution). The xarray example I posted showed the standard deviation being nearly two orders of magnitude higher than its actual value.TomAugspurger commentedon Aug 17, 2018
Thanks for the context. Let's coordinate with xarray here (cc @shoyer), though it still isn't clear to me what the right thing is to do here.
jbrockmendel commentedon Jun 10, 2019
Any updates here?
TomAugspurger commentedon Jun 13, 2019
I'm not sure what's best here. Earlier I implied upcasting float32 -> float64 so that we could use bottleneck. I don't think that's smart. Rather, we would keep float32 and use NumPy.
@agoodm did Xarray make a decision on what to do for float32?
agoodm commentedon Jun 13, 2019
I'd ping them in the associated issue thread I linked earlier, as far as I know nothing has been changed on their end to address this since the issue is still open. I think their sentiment was similar to yours in regards to automatically upcasting float32 -> float64.
4 remaining items