Add pluggable artifact serializer framework#3117
Add pluggable artifact serializer framework#3117saeidbarati157 wants to merge 9 commits intoNetflix:masterfrom
Conversation
Greptile SummaryThis PR replaces the hardcoded pickle logic in
Confidence Score: 4/5Safe to merge after fixing the lazy_import idempotency / test mismatch; the rest of the framework is solid. One confirmed P1: a test introduced in this PR directly contradicts the implementation change and will fail. All other findings are P2 style/defensive improvements. Architecture and backward-compat path look correct. metaflow/datastore/artifacts/serializer.py (lazy_import idempotency logic) and test/unit/test_artifact_serializer.py (test_lazy_import_rejects_double_assignment). Important Files Changed
Reviews (12): Last reviewed commit: "Apply suggestions from code review" | Re-trigger Greptile |
This PR, on top of Netflix#3117, adds the tiny contract that extension authors target when they want to ship typed artifacts. Only the abstract base lives in OSS; concrete scalar/tensor/struct/etc. types and the bridge serializer live downstream (per review with Romain — core should not carry opinions about which types exist, how enums are wire-encoded, tensor byte order, dataclass inference, etc.). Standard Python primitives (``int``, ``str``, ``list``, ``dict``, ...) continue to flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for types that need metadata or invariants attached. IOType contract - ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the signature that Netflix#3117 added on ``ArtifactSerializer``. The same ``WIRE`` / ``STORAGE`` constants govern dispatch so a single subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for the datastore save path. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. - Subclasses implement four hooks: ``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``. - ``type_name`` + ``to_spec()`` support JSON schema generation. - ``IOType`` itself is abstract; instantiating without implementing the four hooks raises ``TypeError``. Tests - ``test/unit/io_types/test_base.py`` — covers abstract instantiation, WIRE and STORAGE round-trips, default format, invalid format rejection, descriptor-mode spec output, and equality/hash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR, on top of Netflix#3117, adds the minimal contract that extension authors target when they want to ship typed artifacts. Only the abstract base lives in OSS; concrete scalar/tensor/struct/etc. types and the bridge serializer belong downstream, where deployment-specific opinions about encoding, byte order, enum wire representation, and dataclass inference can live without being forced on every Metaflow user. Standard Python primitives (``int``, ``str``, ``list``, ``dict``, ...) continue to flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for types that need metadata or invariants attached. IOType contract - ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the signature that Netflix#3117 added on ``ArtifactSerializer``. The same ``WIRE`` / ``STORAGE`` constants govern dispatch so a single subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for the datastore save path. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. - Subclasses implement four hooks: ``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``. - ``type_name`` + ``to_spec()`` support JSON schema generation. - ``IOType`` itself is abstract; instantiating without implementing the four hooks raises ``TypeError``. Tests - ``test/unit/io_types/test_base.py`` — covers abstract instantiation, WIRE and STORAGE round-trips, default format, invalid format rejection, descriptor-mode spec output, and equality/hash.
4c17f76 to
79ed3ef
Compare
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core (wire format for CLI/IPC, cross-language JSON bytes on storage, self-describing schemas via ``to_spec()``, no pickle code-execution risk). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly ``to_spec()`` — not enough on their own for core. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). ``type_name`` + ``to_spec()`` support JSON schema generation. Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Tests - ``test/unit/io_types/test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test/unit/io_types/test_json_type.py`` — round-trips. - ``test/unit/io_types/test_struct_type.py`` — dataclass round-trip, dict round-trip, ``to_spec()`` with dataclass fields. - ``test/unit/io_types/test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata (security), WIRE not supported on the bridge.
79ed3ef to
0044f28
Compare
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core (wire format for CLI/IPC, cross-language JSON bytes on storage, self-describing schemas via ``to_spec()``, no pickle code-execution risk). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly ``to_spec()`` — not enough on their own for core. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). ``type_name`` + ``to_spec()`` support JSON schema generation. Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Tests - ``test/unit/io_types/test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test/unit/io_types/test_json_type.py`` — round-trips. - ``test/unit/io_types/test_struct_type.py`` — dataclass round-trip, dict round-trip, ``to_spec()`` with dataclass fields. - ``test/unit/io_types/test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata (security), WIRE not supported on the bridge.
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core: wire format for CLI/IPC, cross-language JSON bytes on storage, no pickle code-execution risk. ``Struct`` also walks directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))`` round-trips back to its original type (generic containers like ``List[Inner]`` come back as raw JSON values — wrap those explicitly when you need richer reconstruction). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly schema emission — not enough on their own for core. - Rich schema emission from ``Struct.to_spec()``. Extensions that ship primitive wrappers can override to emit fully-typed schemas; core just returns ``{"type": "struct"}``. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Safety - ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize`` both require the class named in artifact metadata to be an actual class (``isinstance(..., type)``) before any further checks. This excludes module-level dataclass *instances* (``is_dataclass`` alone returns ``True`` for those) and other callables that could be invoked with attacker-controlled kwargs. - Importing the metadata-named module can still run module-level side-effect code; the ``Struct`` docstring calls this out so callers don't load artifacts from untrusted sources. Tests - ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test_json_type.py`` — wire and storage round-trips. - ``test_struct_type.py`` — dataclass round-trip, dict round-trip, directly-nested dataclass round-trip, container-field pass-through, rejection of non-dataclass and dataclass-instance metadata. - ``test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata, WIRE not supported on the bridge.
0044f28 to
dcccf73
Compare
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core: wire format for CLI/IPC, cross-language JSON bytes on storage, no pickle code-execution risk. ``Struct`` also walks directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))`` round-trips back to its original type (generic containers like ``List[Inner]`` come back as raw JSON values — wrap those explicitly when you need richer reconstruction). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly schema emission — not enough on their own for core. - Rich schema emission from ``Struct.to_spec()``. Extensions that ship primitive wrappers can override to emit fully-typed schemas; core just returns ``{"type": "struct"}``. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Safety - ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize`` both require the class named in artifact metadata to be an actual class (``isinstance(..., type)``) before any further checks. This excludes module-level dataclass *instances* (``is_dataclass`` alone returns ``True`` for those) and other callables that could be invoked with attacker-controlled kwargs. - Importing the metadata-named module can still run module-level side-effect code; the ``Struct`` docstring calls this out so callers don't load artifacts from untrusted sources. Tests - ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test_json_type.py`` — wire and storage round-trips. - ``test_struct_type.py`` — dataclass round-trip, dict round-trip, directly-nested dataclass round-trip, container-field pass-through, rejection of non-dataclass and dataclass-instance metadata. - ``test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata, WIRE not supported on the bridge.
dcccf73 to
3e9a437
Compare
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core: wire format for CLI/IPC, cross-language JSON bytes on storage, no pickle code-execution risk. ``Struct`` also walks directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))`` round-trips back to its original type (generic containers like ``List[Inner]`` come back as raw JSON values — wrap those explicitly when you need richer reconstruction). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly schema emission — not enough on their own for core. - Rich schema emission from ``Struct.to_spec()``. Extensions that ship primitive wrappers can override to emit fully-typed schemas; core just returns ``{"type": "struct"}``. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Safety - ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize`` both require the class named in artifact metadata to be an actual class (``isinstance(..., type)``) before any further checks. This excludes module-level dataclass *instances* (``is_dataclass`` alone returns ``True`` for those) and other callables that could be invoked with attacker-controlled kwargs. - Importing the metadata-named module can still run module-level side-effect code; the ``Struct`` docstring calls this out so callers don't load artifacts from untrusted sources. Tests - ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test_json_type.py`` — wire and storage round-trips. - ``test_struct_type.py`` — dataclass round-trip, dict round-trip, directly-nested dataclass round-trip, container-field pass-through, rejection of non-dataclass and dataclass-instance metadata. - ``test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata, WIRE not supported on the bridge.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #3117 +/- ##
=========================================
Coverage ? 26.10%
=========================================
Files ? 380
Lines ? 51908
Branches ? 9139
=========================================
Hits ? 13549
Misses ? 37551
Partials ? 808 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Introduce ArtifactSerializer ABC with priority-based dispatch, enabling custom serializers to be plugged in alongside the default pickle path. This is the foundation for the IOType system which adds typed serialization for Metaflow artifacts. New abstractions: - ArtifactSerializer: base class with can_serialize/can_deserialize/ serialize/deserialize interface - SerializerStore: metaclass for auto-registration and deterministic priority-ordered dispatch - SerializationMetadata: namedtuple for artifact metadata routing - SerializedBlob: supports both new bytes and references to already-stored data - PickleSerializer: universal fallback (PRIORITY=9999) wrapping existing pickle logic No existing code is modified. These are inert until wired into TaskDataStore in a follow-up commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Register PickleSerializer through the standard Metaflow plugin
mechanism so extensions (e.g., mli-metaflow-custom) can add their
own serializers via ARTIFACT_SERIALIZERS_DESC.
- Add artifact_serializer category to _plugin_categories
- Add ARTIFACT_SERIALIZERS_DESC with PickleSerializer in plugins/__init__.py
- Resolve via ARTIFACT_SERIALIZERS = resolve_plugins("artifact_serializer")
Importing metaflow.plugins now triggers PickleSerializer registration
in SerializerStore, ensuring the registry is populated before
TaskDataStore needs it (commit 3).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
save_artifacts now loops through priority-ordered serializers to find
one that can_serialize the object. load_artifacts routes deserialization
through metadata.encoding via can_deserialize. PickleSerializer handles
all existing Python objects as the universal fallback.
Behavioral changes:
- New artifacts get encoding "pickle-v4" (was "gzip+pickle-v4")
- _info[name] gains optional "serializer_info" dict for custom serializers
- Removed hardcoded pickle import from task_datastore.py
Backward compatible:
- Old artifacts with "gzip+pickle-v2" or "gzip+pickle-v4" encoding
load correctly (PickleSerializer.can_deserialize handles both)
- Missing encoding defaults to "gzip+pickle-v2"
- Missing serializer_info defaults to {}
- _objects[name] stays as single string (multi-blob deferred to PR 1.5)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SerializerStore now extends ABCMeta (not type) so @AbstractMethod is enforced — incomplete subclasses raise TypeError at definition - Validate blobs non-empty before accessing blobs[0] in save_artifacts - Add NOTE on compress_method not yet wired into save path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire vs storage format - Added WIRE and STORAGE constants in metaflow.datastore.artifacts.serializer. - ArtifactSerializer.serialize/deserialize now accept a ``format`` kwarg so a single class can own both the storage path (datastore blobs + metadata) and the wire path (string for CLI args, protobuf payloads, cross-process IPC). - PickleSerializer implements STORAGE only; WIRE raises NotImplementedError with an explanation (pickle bytes are not safe as a wire payload). - Serializers that want wire support implement it on the same class; there's no need for a second class per format. Lazy import-hook registry - New module metaflow/datastore/artifacts/lazy_registry.py with SerializerConfig, an importlib.abc.MetaPathFinder interceptor, and public register_serializer_for_type / load_serializer_class entry points. - If the target type's module is already in sys.modules, registration is immediate. Otherwise a hook is installed on sys.meta_path and registration fires the first time the user's code imports the target module. This defers the cost of serializer-module imports (torch, pyarrow, fastavro, ...) until those dependencies are actually in play. - find_spec temporarily removes the interceptor from sys.meta_path during its lookup to avoid recursion. Review nits - serializer.py: rename SerializationMetadata.type field to obj_type to avoid shadowing the type() builtin (dict key inside _info stays "type" for backward compatibility with existing datastores). - serializer.py: SerializerStore skips TYPE=None AND any subclass that is still abstract, via inspect.isabstract(). - serializer.py: get_ordered_serializers memoizes the sorted list and invalidates on new registration. Drops the O(n²) list.index tiebreaker — Python 3.7+ dicts preserve insertion order. - task_datastore.py: lift SerializerStore / SerializationMetadata imports to module top (were on hot paths: __init__ and load_artifacts). - task_datastore.py: the "no serializer claimed this artifact" branch now raises DataException with a message pointing at the PickleSerializer fallback invariant, instead of the misleading UnpicklableArtifactException. - task_datastore.py: "no deserializer claimed this artifact" now looks up ``serializer_info["source"]`` and hints at a missing extension. - SerializedBlob.compress_method: removed. It was documented as "not wired into the save path" — shipping an unwired knob invites extension authors to rely on it and discover it is a no-op. Will come back with its consuming code in a later change. - serialize() docstring rewritten: the side-effect-free contract is there to support retries, caching, and parallel dispatch, and to keep serializers testable; I/O belongs in hooks. Tests - New test/unit/test_lazy_serializer_registry.py (9 tests covering config validation, eager registration for already-imported types, deferred registration through the import hook, and the recursion guard). - test_artifact_serializer.py gains format-dispatch coverage (STORAGE and WIRE round-trip on a toy dual-format serializer; PickleSerializer raises on WIRE). - Removed tests for compress_method and adjusted tests that poked at _registration_order so they go through the public API only.
- Convert STORAGE/WIRE module constants to a SerializationFormat enum (str-backed so existing equality checks keep working). Module-level STORAGE/WIRE aliases are dropped. - Add Python type hints to can_serialize, can_deserialize, serialize, and deserialize so editors surface the types documented in docstrings. - Drop the unused `context` parameter from deserialize() across the ABC, PickleSerializer, the TaskDataStore call site, and tests. - Move the `_serializers` property explanation onto the property itself; leave only a short note next to `_serializers_override`. - save_artifacts now validates blob count BEFORE mutating `_info[name]`, so a rejected artifact never leaves partial metadata behind. - Multi-blob error message now points users at the Metaflow team for multi-blob use cases. - "No deserializer claimed artifact" error now includes the full serializer_info dict in addition to the source hint. - SerializerStore.get_ordered_serializers tracks already-materialized lazy classes so the ordered cache is only rebuilt when a new lazy class actually becomes importable — previously, any registered lazy config forced a rebuild on every call. Tests - New regression test for the ordered-cache steady-state behavior. - New regression tests verifying `_info[name]` is not populated when save_artifacts rejects empty or multi-blob serializer output. - Updated existing tests for the enum, type hints, and context removal.
|
Thanks Romain! Addressed in fed2c78:
All 77 serializer/pickle/lazy/integration unit tests pass. |
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core: wire format for CLI/IPC, cross-language JSON bytes on storage, no pickle code-execution risk. ``Struct`` also walks directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))`` round-trips back to its original type (generic containers like ``List[Inner]`` come back as raw JSON values — wrap those explicitly when you need richer reconstruction). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly schema emission — not enough on their own for core. - Rich schema emission from ``Struct.to_spec()``. Extensions that ship primitive wrappers can override to emit fully-typed schemas; core just returns ``{"type": "struct"}``. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Safety - ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize`` both require the class named in artifact metadata to be an actual class (``isinstance(..., type)``) before any further checks. This excludes module-level dataclass *instances* (``is_dataclass`` alone returns ``True`` for those) and other callables that could be invoked with attacker-controlled kwargs. - Importing the metadata-named module can still run module-level side-effect code; the ``Struct`` docstring calls this out so callers don't load artifacts from untrusted sources. Tests - ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test_json_type.py`` — wire and storage round-trips. - ``test_struct_type.py`` — dataclass round-trip, dict round-trip, directly-nested dataclass round-trip, container-field pass-through, rejection of non-dataclass and dataclass-instance metadata. - ``test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata, WIRE not supported on the bridge.
Replace the two-path registration scheme (ARTIFACT_SERIALIZERS_DESC +
register_serializer_for_type) with a single declarative mechanism.
Extensions ship one serializer file per serializer; heavy imports
live in a setup_imports() classmethod whose ModuleNotFoundError
parks the serializer on an import hook and retries once the awaited
module appears in sys.modules.
Author-facing API
-----------------
class TorchSerializer(ArtifactSerializer):
TYPE = "torch"
PRIORITY = 50
@classmethod
def setup_imports(cls, context=None):
cls.lazy_import("torch")
cls.lazy_import("pyarrow", alias="pa")
@classmethod
def can_serialize(cls, obj):
return isinstance(obj, cls.torch.Tensor)
...
# mfextinit_myext.py
ARTIFACT_SERIALIZERS_DESC = [
("torch", "my_ext.serializers.torch_serializer.TorchSerializer"),
]
Mechanism
---------
- SerializerStore.bootstrap() walks every extension's
ARTIFACT_SERIALIZERS_DESC directly (the serializer category owns its
own lifecycle rather than flowing through resolve_plugins), applies
ENABLED_ARTIFACT_SERIALIZER +/- toggles, and drives each entry
through a state machine:
known → importing → class_loaded → importing_deps → active
with terminal states `broken` / `disabled` and the retry state
`pending_on_imports`.
- lazy_import(module_path, alias=None) imports the module, stashes it
on cls, returns it. Rejects reserved names (TYPE, PRIORITY, dispatch
methods, leading underscore) and double-assignment within one
setup_imports() call.
- A sys.meta_path interceptor watches the pending modules; when any
imports, SerializerStore._on_module_imported re-runs the lifecycle
for every parked entry waiting on that module. Loop guard: the same
ImportError.name raising twice → broken.
- Dispatch reads _active_serializers only. Exceptions raised by
can_serialize / can_deserialize on a currently-active serializer are
caught, counted in the record's dispatch_error_count, and the
serializer is skipped for that artifact only — preserving the
PickleSerializer fallback.
- On PRIORITY ties, last-registered wins; a lexicographic tiebreak on
class_path provides cross-environment reproducibility.
Diagnostic API
--------------
metaflow.datastore.artifacts.list_serializer_status() returns a list
of per-entry dicts with name, class_path, state, awaiting_modules,
last_error, priority, type, import_trigger, and dispatch_error_count.
Primary use case: answering "why isn't my custom serializer active?"
without reading source.
Removed
-------
- register_serializer_for_type() (public API)
- SerializerConfig, register_serializer_config, iter_registered_configs,
load_serializer_class, plus the backing declarative-config plumbing
inside lazy_registry.py — nothing in the unified path uses them.
- ARTIFACT_SERIALIZERS = resolve_plugins("artifact_serializer") in
metaflow/plugins/__init__.py (no readers in the codebase).
- "artifact_serializer" from _plugin_categories in
metaflow/extension_support/plugins.py.
Testing
-------
- 102 unit tests across test_artifact_serializer, test_pickle_serializer,
test_serializer_integration, test_serializer_lifecycle, and
test_serializer_public_api. Coverage: state-machine transitions,
retry and loop guard, disabled toggle, dispatch exception handling,
lazy_import reserved-name + collision guards, setup_imports signature
compatibility (both def setup_imports(cls) and
def setup_imports(cls, context=None) are supported), subclass
inheritance, and public API surface assertions.
- SerializerStore._reset_for_tests() clears registry + interceptor state
and walks MRO to delattr lazy-imported attributes for test isolation.
fed2c78 to
b6f6cf4
Compare
When save_artifacts dispatches to a serializer, inject the serializer's source label into ``serializer_info["source"]`` so the "no deserializer claimed artifact" load error can point at the package providing the missing serializer. The source is derived at bootstrap time: ``"metaflow"`` for core serializers, the extension's ``package_name`` (via ``metaflow.extension_support.plugins.get_modules``) for extensions. Authors who set their own ``source`` in the returned ``serializer_info`` are not overridden, so an extension can publish a different identifier (e.g. a logical name rather than the pip distribution name) by passing it through ``serialize()``. The source is also exposed on each diagnostic record returned by ``list_serializer_status()``.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
| def test_lazy_import_rejects_double_assignment(): | ||
| """Calling lazy_import twice with the same alias on the same cls raises.""" | ||
|
|
||
| class _LazyDup(ArtifactSerializer): | ||
| TYPE = "test_lazy_dup" | ||
|
|
||
| @classmethod | ||
| def can_serialize(cls, obj): | ||
| return False | ||
|
|
||
| @classmethod | ||
| def can_deserialize(cls, metadata): | ||
| return False | ||
|
|
||
| @classmethod | ||
| def serialize(cls, obj, format=SerializationFormat.STORAGE): | ||
| raise NotImplementedError | ||
|
|
||
| @classmethod | ||
| def deserialize(cls, data, metadata=None, format=SerializationFormat.STORAGE): | ||
| raise NotImplementedError | ||
|
|
||
| try: | ||
| _LazyDup.lazy_import("json") | ||
| with pytest.raises(ValueError, match="already set"): | ||
| _LazyDup.lazy_import("sys", alias="json") | ||
| finally: | ||
| SerializerStore._all_serializers.pop("test_lazy_dup", None) | ||
| SerializerStore._ordered_cache = None | ||
| if hasattr(_LazyDup, "json"): | ||
| delattr(_LazyDup, "json") | ||
| if hasattr(_LazyDup, "_lazy_imported_names"): | ||
| delattr(_LazyDup, "_lazy_imported_names") |
There was a problem hiding this comment.
Test contradicts the new idempotent
lazy_import behaviour
test_lazy_import_rejects_double_assignment asserts that calling lazy_import("sys", alias="json") after lazy_import("json") raises ValueError("already set"), but the implementation was changed in this PR to return idempotently for any already-registered alias (lines 614-618 of serializer.py). The test will fail with Failed: DID NOT RAISE <class 'ValueError'> because the idempotent branch fires first and returns the cached json module, regardless of the differing module_path argument.
The two cases need to be distinguished: re-importing the same module under the same alias (the retry use-case the idempotent fix was meant to support) should be silent, while binding a different module to an existing alias (a programming error, which this test guards) should still raise.
Summary
Replace the hardcoded pickle logic in
TaskDataStorewith a pluggable artifact serializer framework. Existing pickle behavior is preserved exactly —PickleSerializeris a built-in universal fallback so flows that never register a custom serializer behave identically to before.ArtifactSerializerABC with priority-based dispatch,SerializationMetadatanamedtuple,SerializedBlobvalue type.SerializerStoremetaclass auto-registers subclasses; caches the sorted dispatch list and invalidates on new registration.ARTIFACT_SERIALIZERS_DESC(standard Metaflow plugin pattern).torch,pyarrow) until the user's own code imports the target type.Design
PRIORITY(lower first; ties broken by registration order). On save, the firstcan_serialize(obj) == Truewins. On load,can_deserialize(metadata)routes by theencodingfield in artifact metadata.serialize(obj, format=...)/deserialize(data, format=..., **kw)accept aformatkwarg —STORAGE(default) returns(List[SerializedBlob], SerializationMetadata)for the datastore save path;WIREreturns astrfor CLI args, protobuf payloads, and cross-process IPC. One class owns both representations.PickleSerializerimplementsSTORAGEonly — pickle bytes are not a wire-safe payload — and raisesNotImplementedErrorforWIRE.serialize()must not perform I/O or mutate global state: it may be called multiple times (caching, retries, parallel dispatch) and must stay safely idempotent. Side effects belong in hooks, not serializers.metaflow.datastore.artifacts.lazy_registry.register_serializer_for_type("torch.Tensor", "my_ext.TorchSerializer")stores a declarative config; animportlib.abc.MetaPathFinderwatches fortorchto be imported and only then loads the serializer class. If the target module is already insys.modules, registration is immediate.gzip+pickle-v2/gzip+pickle-v4encoding load correctly. Missingencodingdefaults togzip+pickle-v2.Error paths
save_artifacts: if no serializer claims an object, raiseDataExceptionwith a message pointing at thePickleSerializerfallback invariant (instead of the misleadingUnpicklableArtifactExceptionfor a non-pickle failure).load_artifacts: if no deserializer claims an artifact, the error includesserializer_info["source"]when present, hinting at a missing extension.Test plan
ArtifactSerializer/SerializerStore/SerializedBlob/SerializationMetadata, including wire/storage dispatch and the ABC's abstract-method enforcement.PickleSerializer(round-trips across scalars, containers, custom classes;WIRErejection; encoding validation).TaskDataStoreround-trips, custom serializer priority, and backward compat.spinfailures remain, unrelated).