BUG: .max() raises exception on Series with object dtype and mixture of Timestamp and NaT: TypeError: '>=' not supported between instances of 'Timestamp' and 'float' #58707

kerrickstaley · 2024-05-13T16:25:57Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df1 = pd.DataFrame({
    'a': [pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York')],
})
df2 = pd.DataFrame({
    'a': [pd.NaT],
})
df_concat = pd.concat([df1, df2])
df_concat['a'].max()

Issue Description

The above code raises the exception:

TypeError: '>=' not supported between instances of 'Timestamp' and 'float'

This code is a simplified version of some prod code that caused issues at my work.

I believe what's happening is df1 has dtype datetime[ns, America/New_York] for the column and df2 has dtype datetime[ns], and when you concat them, the resulting dtype is object. Then, .max() coerces pd.NaT to NaN, and you get a comparison between a Timestamp and a float.

Expected Behavior

I expect the code to return pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York').

I think Pandas should be robust and handle this case, even though the dtypes aren't perfectly "correct". I think the right place to fix this is in the .max() function: the code

max([pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York'), pd.NaT])

(where max is builtins.max) works fine, so you would also expect the Pandas equivalent to work.

You could maybe also make an argument that pd.concat should special-case this and return a column with dtype datetime64[ns, America/New_York], but I'm less sure about that.

Longer-term, I feel like Pandas should support datetime columns with heterogenous timezones; the requirement that timezones be the same for a whole column feels like an artificial constraint and many real-world datasets will naturally have heterogenous timezones.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2e python : 3.12.3.final.0 python-bits : 64 OS : Darwin OS-release : 23.4.0 Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:19:22 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8112 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.1.1
pip : 24.0
Cython : None
pytest : 8.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.1.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

asishm · 2024-05-13T21:55:04Z

Thanks for the report. As a workaround, df_concat['a'].convert_dtypes().max() works

li00-lab · 2024-07-24T03:29:21Z

take

li00-lab · 2024-07-26T06:53:24Z

It seems like the original code, will change the data type to object after concatenating both dataframes, therefore the comparison df_concat['a'].max() failed.

import pandas as pd
df1 = pd.DataFrame({
    'a': [pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York')],
})
df2 = pd.DataFrame({
    'a': [pd.NaT],
})
df_concat = pd.concat([df1, df2])

print(df_concat.iloc[0])
print(df_concat.iloc[1])

#output
a    2024-05-13 12:00:00-04:00
Name: 0, dtype: object
a    NaT
Name: 0, dtype: object

However, if we have initialize the dataframe as, then the comparison works as the data type is datetime

df_concat = pd.DataFrame({
    "a": [pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York'), pd.NaT]
})
df_concat['a'].max()  # Timestamp('2024-05-13 12:00:00-0400', tz='America/New_York')

print(df_concat.iloc[0])
print(df_concat.iloc[1])

# Output
a   2024-05-13 12:00:00-04:00
Name: 0, dtype: datetime64[ns, America/New_York]
a   NaT
Name: 1, dtype: datetime64[ns, America/New_York]

For the workaround you provided, it will convert the datatype into datetime and the .max() works as well.

df_concat['a'].convert_dtypes().dtypes # datetime64[ns, America/New_York]
df_concat['a'].convert_dtypes().max() # Timestamp('2024-05-13 12:00:00-0400', tz='America/New_York')

Therefore, it seems like the issue here is due to concat() rather than max(). I hope this make sense and let me research further into this.

kerrickstaley added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 13, 2024

rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Timezones Timezone data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 21, 2024

github-actions bot assigned li00-lab Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: .max() raises exception on Series with object dtype and mixture of Timestamp and NaT: TypeError: '>=' not supported between instances of 'Timestamp' and 'float' #58707

BUG: .max() raises exception on Series with object dtype and mixture of Timestamp and NaT: TypeError: '>=' not supported between instances of 'Timestamp' and 'float' #58707

kerrickstaley commented May 13, 2024 •

edited

Loading

asishm commented May 13, 2024

li00-lab commented Jul 24, 2024

li00-lab commented Jul 26, 2024

BUG: .max() raises exception on Series with object dtype and mixture of Timestamp and NaT: TypeError: '>=' not supported between instances of 'Timestamp' and 'float' #58707

BUG: .max() raises exception on Series with object dtype and mixture of Timestamp and NaT: TypeError: '>=' not supported between instances of 'Timestamp' and 'float' #58707

Comments

kerrickstaley commented May 13, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

asishm commented May 13, 2024

li00-lab commented Jul 24, 2024

li00-lab commented Jul 26, 2024

kerrickstaley commented May 13, 2024 •

edited

Loading