Skip to content

Avoid pickle/dill hell due to collections.namedTuple #7

@casassg

Description

@casassg

Several libraries monkeypatch collections.namedTuple as it's difficult to pickle it (namingly pyspark: https://github.com/apache/spark/blob/ee8d66105885929ac0c0c087843d70bf32de31a1/python/pyspark/serializers.py#L385 and beam: https://github.com/apache/beam/blob/v2.21.0/sdks/python/apache_beam/internal/pickler.py#L150)

This makes it difficult to use TFX when you have pyspark on your environment as the libraries try to hijack the pickling at the same time.

I know this is an issue of pyspark and beam trying to solve this monkey patching, but I'm wondering if it's possible to move out of namedtuples at all from TFX BSL codebase. Found the issue while trying to launch a Dataflow job from my JupyterLab environment which has PySpark in the environment.

The issue came when dill serializes the namedtuple for the default values in:

ColumnInfo = collections.namedtuple(

This uses pyspark as pyspark has hijacked the serializer.

Wondering, is there any way to avoid using named_tuples in tfx_bsl all together? That would help avoid this chaos I had to go through to find the issue for future people who may be in a similar environment.

Related: https://issues.apache.org/jira/browse/SPARK-22674
Filed new issue in PySpark: https://issues.apache.org/jira/browse/SPARK-32079

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions