-
Notifications
You must be signed in to change notification settings - Fork 37
Question about blockdedup and call to count() #232
Copy link
Copy link
Open
Description
In the method blockdedup ,
We have the following code :
val distinctPartitions = blocked.select(partColumn).distinct().count()
val hashPart = new HashPartitioner(distinctPartitions.toInt)
val blockedRDD = blocked.rdd
.keyBy(x => x.getString(x.fieldIndex(partColumn)))
.partitionBy(hashPart)
If I understand correctly , calling .count() will evaluate the dataframe. Wouldn't it be beneficial to persist the dataframe , do the count and then do the keyBy ?
Also , why can't we just pass partitionBy with the field name?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels