Avoid pickle/dill hell due to collections.namedTuple

Several libraries monkeypatch `collections.namedTuple` as it's difficult to pickle it (namingly pyspark: https://github.com/apache/spark/blob/ee8d66105885929ac0c0c087843d70bf32de31a1/python/pyspark/serializers.py#L385  and beam: https://github.com/apache/beam/blob/v2.21.0/sdks/python/apache_beam/internal/pickler.py#L150) 

This makes it difficult to use TFX when you have pyspark on your environment as the libraries try to hijack the pickling at the same time. 

I know this is an issue of pyspark and beam trying to solve this monkey patching, but I'm wondering if it's possible to move out of namedtuples at all from TFX BSL codebase. Found the issue while trying to launch a Dataflow job from my JupyterLab environment which has PySpark in the environment. 

The issue came when dill serializes the namedtuple for the default values in: https://github.com/tensorflow/tfx-bsl/blob/a49260a3935ef0f35456ec119a827b158eba3b2a/tfx_bsl/coders/csv_decoder.py#L56
This uses pyspark as pyspark has hijacked the serializer.

Wondering, is there any way to avoid using named_tuples in `tfx_bsl` all together? That would help avoid this chaos I had to go through to find the issue for future people who may be in a similar environment.

Related:  https://issues.apache.org/jira/browse/SPARK-22674
Filed new issue in PySpark: https://issues.apache.org/jira/browse/SPARK-32079

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid pickle/dill hell due to collections.namedTuple #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoid pickle/dill hell due to collections.namedTuple #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions