Skip to content

BUG: parquet serialization/deserialization adds all dict keys into column #56842

Open
@arogozhnikov

Description

@arogozhnikov

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.DataFrame({'dictcol': [{'a': 1}, {'b': 2}, {'c': None}]}).to_parquet('/tmp/data.pqt')
pd.read_parquet('/tmp/data.pqt')
# loaded dataframe contains all keys in every row

Issue Description

I have a column of type dict[str, int], If I save and load the dataframe to parquet, every entry in column is filled with all keys.

So there are two problems: 1. it does not faithfully represents what was saved 2. it blows up because there are many keys that are resent in one-two rows.

Maybe relevant (not sure): #55776

Expected Behavior

Saved and loaded dataframes are identical.

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:28 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.4
numpy : 1.24.3
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.0
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.3.0
pandas_datareader : None
bs4 : 4.11.1
bottleneck : None
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.0
gcsfs : None
matplotlib : 3.8.2
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 14.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
sqlalchemy : 2.0.4
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2022.7
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions