Skip to content

Mean and std for float32 dataframe #22385

Open
@m-rossi

Description

@m-rossi

Code Sample

import pandas as pd
import numpy as np

data = pd.DataFrame(np.ones(60000).astype('float32') * 1e6)

print(np.mean(data))
print(np.mean(data.values))

print(np.std(data))
print(np.std(data.values))

which results in the following output

0    999645.0
dtype: float32
1000000.06
0    354.936188
dtype: float32
0.0625

Problem description

I have a pandas dataframe with 32-bit floating numbers inside and try to calculate the mean and std values, where these calculations can go horribly wrong. However I am able reproduce the issue with the example above.

I know pandas uses internal functions for mean and std instead of using the numpy functions. This seems to be the issue here. If I add .values the result is a lot more accurate. The problem disappears if I use float64.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: de_DE.cp1252

pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.5
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions