-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Hi all,
I've noticed an issue with mutate when you define variables using delay functions and group_by. I think the problem is actually just with mutating not working properly with group_by but I haven't extensively tested. For example:
@DelayFunction
def lead(series, i=1):
index = series.index
shifted = series.shift(i)
shifted.index = index
return shifted
diamonds >> group_by(X.cut) >> mutate(price_lead = lead(X.price)) >> head(6)
Unnamed: 0 carat cut color clarity depth table price x \
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89
2 3 0.23 Good E VS1 56.9 65.0 327 4.05
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20
4 5 0.31 Good J SI2 63.3 58.0 335 4.34
5 6 0.24 Very Good J VVS2 62.8 57.0 336 3.94
y z price_lead
0 3.98 2.43 NaN
1 3.84 2.31 326.0
2 4.07 2.31 326.0
3 4.23 2.63 327.0
4 4.35 2.75 334.0
5 3.96 2.48 335.0
The lead
delay function should operate independently on each group, but instead it is operating on the entire dataframe regardless of group.
I solved this in my own fork of dplython by removing mutate from the handled classes in the DplyFrame class. I assume however that you put it in handled classes for a reason, so I don't consider this a great fix (for example, arrange broke due to this and I had to change it to work again).
Curious to hear your opinion on this.
P.S. There are tons of changes and additions in that personal fork that I should make pull requests for, but a lot has changed including the formatting and so I've been lazy about it...