Skip to content

Conversation

@chenliu0831
Copy link

Issue #, if available:

See awslabs/python-deequ#254

Description of changes:

Initial effort to evolve PyDeequ to use Spark Connect instead of the currently fragile Py4J based bridge.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.


// The transform method receives protobuf Any from Spark Connect
// Scala compiler sees com.google.protobuf.Any in the interface signature
override def transform(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore

In Spark 4.x the signature was changed from relation: protobuf.Any to relation: Array[Byte]. To avoid pain during the migration I would strongly recommend to keep transform as small as possible and better in a separate class. In GraphFrames we separated implementation of the plugin and the plugin logic to be able to have two versions for different spark. You can see an example here: spark3 and spark4

Otherwise you may need to duplicate the whole logic on a day you will work on support of the spark 4.x

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call. Thanks, I haven't considered much about Spark 3.x to 4.x breaking change yet (it seems more annoying than I thought..). Let me revisit this in a new revision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants