Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: get_dummies behaviour with dummy_na = True is counter-intuitive / incorrect when no NaN values present #59968

Open
1 task done
DM-Berger opened this issue Oct 4, 2024 · 5 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Docs

Comments

@DM-Berger
Copy link

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

Documentation problem

The docs currently read for this function:

Add a column to indicate NaNs, if False NaNs are ignored.

However, when no NaN values are present, a useless constant NaN indicator column is still added:

>>> df = pd.DataFrame(pd.Series([0, 1], dtype="int64"))
>>> df
   0
0  0
1  1
>>> pd.get_dummies(df, columns=[0], dummy_na=True, drop_first=True)
   0_1.0  0_nan
0  False  False
1   True  False
>>> pd.get_dummies(df, columns=[0], dummy_na=True)
   0_0.0  0_1.0  0_nan
0   True  False  False
1  False   True  False

This is arguably quite unexpected behaviour, as constant columns do not contain information except in some very rare cases and for specific custom models.

I.e. for almost any kind of model this column will be ignored (but then annoyingly clutter e.g. .feature_importances variables and also perhaps needlessly increase compute times for algorithms that scale significantly with the number of features, but which may not have methods for ignoring constant input features). For data with a lot of binary features, and pipelines or models which also might do e.g. conversion to floating-point dtypes, all these useless extra constant features could also significantly increase memory requirements (especially in multiprocessing contexts).

I imagine the intended design decision is that if you do e.g. df1, df2 = train_test_split(df), and it ends up such that df1 doesn't have any NaN values for some feature "f", but df2 does, then at least with the current implementation, the user can be ensured the following does not raise an AssertionError:

dummies1 = pd.get_dummies(df1, ["f"], dummy_na=True)
dummies2 = pd.get_dummies(df2, ["f"], dummy_na=True)
assert dummies1.columns.tolist() == dummies2.columns.tolist()

But still, that is a pretty strange use-case, as dummification should generally happen right at the beginning on the full data.

In my opinion the default behaviour should to only add a NaN indicator column if NaN values are actually present... . I actually would consider this to be an implementation or design bug, for like 99% of use-cases. But at bare minimum this undesirable and unexpected behaviour should be documented with some reasoning.

Suggested fix for documentation

Just change the docs to:

Add a column to indicate NaNs. If True, A NaN indicator column will be added even if no NaN values are present [insert reasoning here]. If False, NaNs are ignored [actually "ignored" is extremely unclear too, as per this issue].

@DM-Berger DM-Berger added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2024
@saldanhad
Copy link
Contributor

saldanhad commented Oct 5, 2024

In an ideal modeling scenario one would only proceed to encoding step if they have completed the data cleaning process and no null values are present, in which case the extra column generated is harmless since it has no variance to be picked up as a feature irrespective of the model. Having said that I would not want to remove this feature, because it could be potentially used as a final data prep step to detect if there are any pending nulls present left to be treated.

However, I do agree with adding it to docs. Maybe adding a brief comment to the dummy_na = True example would suffice.

@DM-Berger
Copy link
Author

DM-Berger commented Oct 5, 2024

In an ideal modeling scenario one would only proceed to encoding step if they have completed the data cleaning process and no null values are present,

This is not quite right IMO, as there are many ways to deal with NaN values, one of them being to convert them to indicators rather than to interpolate them. I.e. this function can just as much be regarded as a cleaning step.

Another example of where this behaviour is extremely counter-intuitive / unproductive: if a DataFrame X is such that all n features are binary indicator values, but with NaN values being possible and only present in k of the n features, then using pd.get_dummies(X, X.columns, dummy_na=True, drop_first=True) will produce a new DataFrame with n new columns, k of which are all constant and identical to each other.

This is not only extremely counter-intuitive, but if, as you said, encoding is supposed to be done "after cleaning", then this behaviour basically forces the user to clean the data again after encoding, meaning it breaks your ideal modeling scenario data flow as well (necessitating clean -> encode -> clean again).

Of course, it is trivial to do this step yourself using DataFrame.isnull and a pd.concat, but pd.get_dummies would seem to be the function one would want to reach for to do this automatically.

the extra column generated is harmless since it has no variance to be picked up as a feature irrespective of the model

This is unfortunately just not true at all. There is nothing really harmless about e.g. k extra constant and identical columns if the later model is, say, a deep learner (in which case these constant columns will often be converted to floating point of some kind, and then there will be extra gradient computations due to a bunch of useless constants), or if that data is being passed to e.g. a stepwise feature selection model. As I also said in a multiprocessing context, k unexpected useless new features (especially if converting to float) can be a significant amount of extra memory, especially if on a cluster with e.g. 80 cores. This can happen in automated ML pipelines and the like. Whether extra constant and identical columns are harmless or not depends on the use-case and downstream task.

But anyway, I agree this is all fixed by just making the documentation clearer. It would also be nice to know the reasoning though, because I don't think "constant columns are harmless" is really generally true. If this is an assumption baked into pd.get_dummies behaviour, the user should be informed.

@rhshadrach
Copy link
Member

In my opinion the default behaviour should to only add a NaN indicator column if NaN values are actually present...

In my opinion, it would be surprising to pass dummy_na=True and not get the corresponding columns. In addition, the proposed behavior would make the schema of the result dependent on the values of the input. This is very undesirable behavior.

I'm strongly opposed here.

@rhshadrach rhshadrach added Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 5, 2024
@DM-Berger
Copy link
Author

DM-Berger commented Oct 5, 2024

@rhshadrach I can definitely understand why people might have different expectations here about the behaviour, but that is why I ultimately filed this as a documentation issue. I wouldn't want behaviour to change for people who expected the current behaviour.

Are you saying you oppose changing the documentation too as well though?

EDIT: And if the above reasoning is that we want pd.get_dummies outputs to not depend on the values of the inputs, this also is currently not the case, since obviously the amount of columns produced by get_dummies does indeed usually depend on the input:

>>> df = DataFrame({"a": [0, 1, 1], "b": [0, 0, 1]})
>>> pd.get_dummies(df, columns=["a", "b"], dummy_na=True, drop_first=True)
   a_1.0  a_nan  b_1.0  b_nan
0  False  False  False  False
1   True  False  False  False
2   True  False   True  False

>>> df.iloc[2, 0] = 2
>>> pd.get_dummies(df, columns=["a", "b"], dummy_na=True, drop_first=True)
   a_1.0  a_2.0  a_nan  b_1.0  b_nan
0  False  False  False  False  False
1   True  False  False  False  False
2  False   True  False   True  False

@DM-Berger
Copy link
Author

DM-Berger commented Oct 5, 2024

Perhaps just to support the idea that people might have very different inututions here, e.g. scikit-learn MissingIndicator has the following argument:

features {‘missing-only’, ‘all’}, default=’missing-only’

Whether the imputer mask should represent all or a subset of features.

  • If 'missing-only' (default), the imputer mask will only represent features containing missing values during fit time.
  • If 'all', the imputer mask will represent all features.

I.e. the default behaviour is arguably closer to what I expected. I think a simple documentation change would make this clearer to users of pd.get_dummies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Docs
Projects
None yet
Development

No branches or pull requests

3 participants