Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Series.diff with boolean dtype does not return a series of dtype float #57565

Open
1 task done
from-nowhere opened this issue Feb 22, 2024 · 4 comments
Open
1 task done
Labels
Docs Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action Transformations e.g. cumsum, diff, rank

Comments

@from-nowhere
Copy link

from-nowhere commented Feb 22, 2024

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.Series.diff.html#pandas.Series.diff
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html#pandas.DataFrame.diff

Documentation problem

The documentation for pandas.Series.diff and pandas.DataFrame.diff states that no matter the dtype of the original series/column, the output will be of dtype float64. This is not true for series/columns of dtypes bool -- the output here is of dtype object.

For example:

import pandas as pd
# pd.__version__ == '2.2.0'

s = pd.Series([True, True, False, False, True])
d = s.diff()

# d.dtype is now 'object'

Indeed, the underlying function algorithms.diff explicitly differentiates between boolean and integer dtypes.

Suggested fix for documentation

The Notes section should read something like this:

Notes
-----
For boolean dtypes, this uses :meth:`operator.xor` rather than
:meth:`operator.sub` and the result's dtype is ``object``.
Otherwise, the result is calculated according to the current dtype in {klass},
however the dtype of the result is always float64.
@from-nowhere from-nowhere added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 22, 2024
@from-nowhere
Copy link
Author

I am also open to change the function behaviour to return a float64 dtype, as advertised. This would also solve another problem: in 2.2.0, downcasting object dtypes by fillna etc. is deprecated: s.diff().fillna(False) produces a warning and the best replacement I could come up with is s.diff().astype(float).fillna(0.0).astype(bool)

@rhshadrach
Copy link
Member

Thanks for the report. Returning object is the behavior on 1.0.0; haven't tried prior versions. I'm +1 on agreeing with the docs here and returning float64 instead of object dtype.

It might also be worth adding a fill_value argument, similar to shift.

Could use some other eyes - cc @MarcoGorelli.

@rhshadrach rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions Transformations e.g. cumsum, diff, rank Needs Discussion Requires discussion from core team before further action Needs Triage Issue that has not been reviewed by a pandas team member and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 7, 2024
@mutricyl
Copy link
Contributor

mutricyl commented Apr 3, 2024

Looking at issue 821 in pandas-stubs it looks like Series.diff() does not always returns float64 it can return pandas.Timedelta.

Beside is look strange to me when making the difference of 2 integers to return a float. It may be due to pandas/numpy internal but mathematically speeking int - int = int. ( uint - uint = int to link with #4899)

I made a short review of current behavior:

float ➡️ float
int ➡️ float
unit ➡️ float
timestamp ➡️ timedelta
period ➡️ object
timedelta ➡️ timedelta64
interval ➡️ TypeError: IntervalArray has no 'diff' method. Convert to a suitable dtype prior to calling 'diff'.
bool ➡️ object
object ➡️ object

@mutricyl
Copy link
Contributor

mutricyl commented Apr 6, 2024

I made some more test to cover all possible series types:

bytes ➡️ numpy.core._exceptions._UFuncNoLoopError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('S21'), dtype('S21')) -> None
dtype ➡️ TypeError: unsupported operand type(s) for -: 'type' and 'type'
datetime.date ➡️ timeDelta
datetime.time ➡️ TypeError: unsupported operand type(s) for -: 'datetime.time' and 'datetime.time'
complex ➡️ complex
Baseoffset considered already as object ➡️ object

Pretty strange that datetime.time throws and error. Would that be a bug ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action Transformations e.g. cumsum, diff, rank
Projects
None yet
Development

No branches or pull requests

3 participants