Skip to content

Question about blockdedup and call to count() #232

@yeikel

Description

@yeikel

In the method blockdedup ,

We have the following code :

 val distinctPartitions = blocked.select(partColumn).distinct().count()
    val hashPart = new HashPartitioner(distinctPartitions.toInt)

val blockedRDD = blocked.rdd
      .keyBy(x => x.getString(x.fieldIndex(partColumn)))
      .partitionBy(hashPart)

If I understand correctly , calling .count() will evaluate the dataframe. Wouldn't it be beneficial to persist the dataframe , do the count and then do the keyBy ?

Also , why can't we just pass partitionBy with the field name?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions