Skip to content

Mean and std for float32 dataframe #22385

Open
@m-rossi

Description

Code Sample

import pandas as pd
import numpy as np

data = pd.DataFrame(np.ones(60000).astype('float32') * 1e6)

print(np.mean(data))
print(np.mean(data.values))

print(np.std(data))
print(np.std(data.values))

which results in the following output

0    999645.0
dtype: float32
1000000.06
0    354.936188
dtype: float32
0.0625

Problem description

I have a pandas dataframe with 32-bit floating numbers inside and try to calculate the mean and std values, where these calculations can go horribly wrong. However I am able reproduce the issue with the example above.

I know pandas uses internal functions for mean and std instead of using the numpy functions. This seems to be the issue here. If I add .values the result is a lot more accurate. The problem disappears if I use float64.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: de_DE.cp1252

pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.5
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Activity

TomAugspurger

TomAugspurger commented on Aug 16, 2018

@TomAugspurger
Contributor

Looks like a bottleneck issue:

In [6]: import bottleneck as bn

In [7]: bn.nanmean
Out[7]: <function bottleneck.reduce.nanmean>

In [8]: bn.nanmean(data.values)
Out[8]: 999645.0

You may want to check with them to see if this is expected.

You can disable bottleneck with pd.options.compute.use_bottleneck.

added this to the No action milestone on Aug 16, 2018
TomAugspurger

TomAugspurger commented on Aug 16, 2018

@TomAugspurger
Contributor

Perhaps a known issue: pydata/bottleneck#193

That references nansum / nanmean, but not nanstd.

I don't think it's worth silently astyping to float64 when people have float32 data and are using bottleneck, but I would be curious to hear other's thoughts.

m-rossi

m-rossi commented on Aug 16, 2018

@m-rossi
Author

Thanks, I will use your suggestion for now and we should watch the issue at bottleneck. But as I said, for my real data the values were horribly wrong, especially for the std.

TomAugspurger

TomAugspurger commented on Aug 16, 2018

@TomAugspurger
Contributor

Indeed. My usual inclination is to fix this upstream and not add temporary workarounds, but given the incorrectness of the result, workarounds may be worth considering.

agoodm

agoodm commented on Aug 17, 2018

@agoodm

I am the one who opened the associated issue in the bottleneck repo because it lead to significant errors for a workflow I was doing with xarray, see: pydata/xarray#2370

Yes it also affects nanstd, and yes it can lead to far more significant errors than what OP is showing under the right conditions (a large enough sample with a relatively narrow distribution). The xarray example I posted showed the standard deviation being nearly two orders of magnitude higher than its actual value.

TomAugspurger

TomAugspurger commented on Aug 17, 2018

@TomAugspurger
Contributor

Thanks for the context. Let's coordinate with xarray here (cc @shoyer), though it still isn't clear to me what the right thing is to do here.

removed this from the No action milestone on Aug 19, 2018
jbrockmendel

jbrockmendel commented on Jun 10, 2019

@jbrockmendel
Member

Any updates here?

TomAugspurger

TomAugspurger commented on Jun 13, 2019

@TomAugspurger
Contributor

I'm not sure what's best here. Earlier I implied upcasting float32 -> float64 so that we could use bottleneck. I don't think that's smart. Rather, we would keep float32 and use NumPy.

@agoodm did Xarray make a decision on what to do for float32?

agoodm

agoodm commented on Jun 13, 2019

@agoodm

I'd ping them in the associated issue thread I linked earlier, as far as I know nothing has been changed on their end to address this since the issue is still open. I think their sentiment was similar to yours in regards to automatically upcasting float32 -> float64.

4 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

      Participants

      @TomAugspurger@m-rossi@agoodm@jbrockmendel@gfyoung

      Issue actions

        Mean and std for float32 dataframe · Issue #22385 · pandas-dev/pandas