Skip to content

[BUG] cuDF-Pandas reads missing values as <NA> whereas Pandas reads as NaN #18504

Open
@cdeotte

Description

@cdeotte

Describe the bug
When using magic command to zero code accelerate Pandas, CSV are read differently.

In Pandas, missing values are read as NaN and in cuDF-Pandas, missing values are read as <NA>. This makes a big difference afterward if users are combining columns with df[COL1].astype('str')+'_'+df[COL2].astype('str'). In Pandas this creates a rich result that combines NaN and number. For example when COL1=NaN and COL2=3 then the string combination is nan_3.0 but in cuDF-Pandas the result is always <NA> and much information is lost in the combination.

Steps/Code to reproduce bug

csv_content = """a,b
1,4
2,5
3,6
,7
"""
with open("example.csv", "w") as f:
    f.write(csv_content)
df = pd.read_csv("example.csv")

Expected behavior
It would be nice if both cuDF-Pandas matched the behavior of Pandas and read missing values as NaN

Environment overview (please complete the following information)
RAPIDS '25.02.02'

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

PythonAffects Python cuDF API.bugSomething isn't working

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions