Skip to content

Add pluggable artifact serializer framework#3117

Open
saeidbarati157 wants to merge 9 commits intoNetflix:masterfrom
saeidbarati157:feat/pluggable-serializers
Open

Add pluggable artifact serializer framework#3117
saeidbarati157 wants to merge 9 commits intoNetflix:masterfrom
saeidbarati157:feat/pluggable-serializers

Conversation

@saeidbarati157
Copy link
Copy Markdown

@saeidbarati157 saeidbarati157 commented Apr 15, 2026

Summary

Replace the hardcoded pickle logic in TaskDataStore with a pluggable artifact serializer framework. Existing pickle behavior is preserved exactly — PickleSerializer is a built-in universal fallback so flows that never register a custom serializer behave identically to before.

  • ArtifactSerializer ABC with priority-based dispatch, SerializationMetadata namedtuple, SerializedBlob value type.
  • SerializerStore metaclass auto-registers subclasses; caches the sorted dispatch list and invalidates on new registration.
  • Extensions register serializers via ARTIFACT_SERIALIZERS_DESC (standard Metaflow plugin pattern).
  • A lazy import-hook registry lets extensions defer serializer-module imports (e.g. torch, pyarrow) until the user's own code imports the target type.

Design

  • Dispatch. Serializers are sorted by PRIORITY (lower first; ties broken by registration order). On save, the first can_serialize(obj) == True wins. On load, can_deserialize(metadata) routes by the encoding field in artifact metadata.
  • Wire vs storage format. serialize(obj, format=...) / deserialize(data, format=..., **kw) accept a format kwarg — STORAGE (default) returns (List[SerializedBlob], SerializationMetadata) for the datastore save path; WIRE returns a str for CLI args, protobuf payloads, and cross-process IPC. One class owns both representations. PickleSerializer implements STORAGE only — pickle bytes are not a wire-safe payload — and raises NotImplementedError for WIRE.
  • Side-effect-free contract. serialize() must not perform I/O or mutate global state: it may be called multiple times (caching, retries, parallel dispatch) and must stay safely idempotent. Side effects belong in hooks, not serializers.
  • Lazy registration. metaflow.datastore.artifacts.lazy_registry.register_serializer_for_type("torch.Tensor", "my_ext.TorchSerializer") stores a declarative config; an importlib.abc.MetaPathFinder watches for torch to be imported and only then loads the serializer class. If the target module is already in sys.modules, registration is immediate.
  • Backward compatible. Old artifacts with gzip+pickle-v2 / gzip+pickle-v4 encoding load correctly. Missing encoding defaults to gzip+pickle-v2.

Error paths

  • save_artifacts: if no serializer claims an object, raise DataException with a message pointing at the PickleSerializer fallback invariant (instead of the misleading UnpicklableArtifactException for a non-pickle failure).
  • load_artifacts: if no deserializer claims an artifact, the error includes serializer_info["source"] when present, hinting at a missing extension.

Test plan

  • 29 unit tests for ArtifactSerializer / SerializerStore / SerializedBlob / SerializationMetadata, including wire/storage dispatch and the ABC's abstract-method enforcement.
  • 36 unit tests for PickleSerializer (round-trips across scalars, containers, custom classes; WIRE rejection; encoding validation).
  • 9 unit tests for the lazy registry (config validation, eager registration, deferred registration via the import hook, recursion guard).
  • 6 integration tests for TaskDataStore round-trips, custom serializer priority, and backward compat.
  • 245 existing unit tests pass (only pre-existing spin failures remain, unrelated).

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 15, 2026

Greptile Summary

This PR replaces the hardcoded pickle logic in TaskDataStore with a pluggable ArtifactSerializer framework. The architecture is well-thought-out — SerializerStore correctly uses ABCMeta, _serializers is now a property (resolving the stale-snapshot issue from prior review), and empty-blob / multi-blob validation guards are in place.

  • P1 — test will fail: test_lazy_import_rejects_double_assignment expects ValueError(\"already set\") when lazy_import is called with a different module_path but an already-registered alias. The idempotent-return fix silently returns the cached module instead, so pytest.raises(ValueError, match=\"already set\") will fail with DID NOT RAISE. The two cases (same-module retry vs. different-module rebind) need to be distinguished.

Confidence Score: 4/5

Safe to merge after fixing the lazy_import idempotency / test mismatch; the rest of the framework is solid.

One confirmed P1: a test introduced in this PR directly contradicts the implementation change and will fail. All other findings are P2 style/defensive improvements. Architecture and backward-compat path look correct.

metaflow/datastore/artifacts/serializer.py (lazy_import idempotency logic) and test/unit/test_artifact_serializer.py (test_lazy_import_rejects_double_assignment).

Important Files Changed

Filename Overview
test/unit/test_artifact_serializer.py test_lazy_import_rejects_double_assignment expects ValueError("already set") but the implementation now returns idempotently — this test will fail.
metaflow/datastore/artifacts/serializer.py Core registry and ABC; idempotent lazy_import fix silently drops the "different module, same alias" guard that the test still asserts.
metaflow/datastore/task_datastore.py Pickle dispatch replaced with pluggable serializer framework; _serializers is now a property that calls get_ordered_serializers() on each access, resolving the stale-snapshot concern from prior review.
metaflow/plugins/datastores/serializers/pickle_serializer.py Universal fallback serializer; unconditionally uses pickle protocol 4, dropping the protocol-2 / large-object fallback path from the original code.
metaflow/datastore/artifacts/lazy_registry.py Import-hook interceptor that watches for awaited modules and triggers SerializerStore._on_module_imported; implementation looks correct.
metaflow/plugins/init.py Adds ARTIFACT_SERIALIZERS_DESC and calls SerializerStore.bootstrap() at import time, making PickleSerializer available for dispatch.
metaflow/datastore/artifacts/diagnostic.py New diagnostic dataclass and list_serializer_status() helper; straightforward, no issues.

Reviews (12): Last reviewed commit: "Apply suggestions from code review" | Re-trigger Greptile

Comment thread metaflow/datastore/artifacts/serializer.py Outdated
Comment thread metaflow/datastore/task_datastore.py Outdated
Comment thread metaflow/datastore/artifacts/serializer.py Outdated
Comment thread metaflow/datastore/task_datastore.py Outdated
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 17, 2026
This PR, on top of Netflix#3117, adds the tiny contract that extension authors
target when they want to ship typed artifacts. Only the abstract base
lives in OSS; concrete scalar/tensor/struct/etc. types and the bridge
serializer live downstream (per review with Romain — core should not
carry opinions about which types exist, how enums are wire-encoded,
tensor byte order, dataclass inference, etc.).

Standard Python primitives (``int``, ``str``, ``list``, ``dict``, ...)
continue to flow through ``PickleSerializer`` unchanged. Wrapping is
opt-in, for types that need metadata or invariants attached.

IOType contract
- ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
  mirror the signature that Netflix#3117 added on ``ArtifactSerializer``. The
  same ``WIRE`` / ``STORAGE`` constants govern dispatch so a single
  subclass owns both representations:
  - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for the
    datastore save path.
  - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
    cross-process IPC.
- Subclasses implement four hooks: ``_wire_serialize``,
  ``_wire_deserialize``, ``_storage_serialize``,
  ``_storage_deserialize``.
- ``type_name`` + ``to_spec()`` support JSON schema generation.
- ``IOType`` itself is abstract; instantiating without implementing
  the four hooks raises ``TypeError``.

Tests
- ``test/unit/io_types/test_base.py`` — covers abstract instantiation,
  WIRE and STORAGE round-trips, default format, invalid format
  rejection, descriptor-mode spec output, and equality/hash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread metaflow/datastore/artifacts/lazy_registry.py Outdated
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 17, 2026
This PR, on top of Netflix#3117, adds the minimal contract that extension
authors target when they want to ship typed artifacts. Only the
abstract base lives in OSS; concrete scalar/tensor/struct/etc. types
and the bridge serializer belong downstream, where deployment-specific
opinions about encoding, byte order, enum wire representation, and
dataclass inference can live without being forced on every Metaflow
user.

Standard Python primitives (``int``, ``str``, ``list``, ``dict``, ...)
continue to flow through ``PickleSerializer`` unchanged. Wrapping is
opt-in, for types that need metadata or invariants attached.

IOType contract
- ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
  mirror the signature that Netflix#3117 added on ``ArtifactSerializer``. The
  same ``WIRE`` / ``STORAGE`` constants govern dispatch so a single
  subclass owns both representations:
  - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for the
    datastore save path.
  - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
    cross-process IPC.
- Subclasses implement four hooks: ``_wire_serialize``,
  ``_wire_deserialize``, ``_storage_serialize``,
  ``_storage_deserialize``.
- ``type_name`` + ``to_spec()`` support JSON schema generation.
- ``IOType`` itself is abstract; instantiating without implementing
  the four hooks raises ``TypeError``.

Tests
- ``test/unit/io_types/test_base.py`` — covers abstract instantiation,
  WIRE and STORAGE round-trips, default format, invalid format
  rejection, descriptor-mode spec output, and equality/hash.
@saeidbarati157 saeidbarati157 force-pushed the feat/pluggable-serializers branch from 4c17f76 to 79ed3ef Compare April 17, 2026 16:21
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 18, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core (wire format for CLI/IPC, cross-language JSON bytes
  on storage, self-describing schemas via ``to_spec()``, no pickle
  code-execution risk).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly ``to_spec()`` — not enough on their own for core.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). ``type_name`` + ``to_spec()`` support JSON
schema generation. Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Tests

- ``test/unit/io_types/test_base.py`` — abstract instantiation,
  WIRE/STORAGE dispatch, invalid format, equality/hash, spec.
- ``test/unit/io_types/test_json_type.py`` — round-trips.
- ``test/unit/io_types/test_struct_type.py`` — dataclass round-trip,
  dict round-trip, ``to_spec()`` with dataclass fields.
- ``test/unit/io_types/test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through
  dataclass reconstruction, rejection of non-IOType classes in
  metadata (security), WIRE not supported on the bridge.
@saeidbarati157 saeidbarati157 force-pushed the feat/pluggable-serializers branch from 79ed3ef to 0044f28 Compare April 18, 2026 01:49
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 18, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core (wire format for CLI/IPC, cross-language JSON bytes
  on storage, self-describing schemas via ``to_spec()``, no pickle
  code-execution risk).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly ``to_spec()`` — not enough on their own for core.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). ``type_name`` + ``to_spec()`` support JSON
schema generation. Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Tests

- ``test/unit/io_types/test_base.py`` — abstract instantiation,
  WIRE/STORAGE dispatch, invalid format, equality/hash, spec.
- ``test/unit/io_types/test_json_type.py`` — round-trips.
- ``test/unit/io_types/test_struct_type.py`` — dataclass round-trip,
  dict round-trip, ``to_spec()`` with dataclass fields.
- ``test/unit/io_types/test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through
  dataclass reconstruction, rejection of non-IOType classes in
  metadata (security), WIRE not supported on the bridge.
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 18, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core: wire format for CLI/IPC, cross-language JSON bytes
  on storage, no pickle code-execution risk. ``Struct`` also walks
  directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))``
  round-trips back to its original type (generic containers like
  ``List[Inner]`` come back as raw JSON values — wrap those
  explicitly when you need richer reconstruction).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly schema emission — not enough on their own for core.
- Rich schema emission from ``Struct.to_spec()``. Extensions that ship
  primitive wrappers can override to emit fully-typed schemas; core
  just returns ``{"type": "struct"}``.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Safety

- ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize``
  both require the class named in artifact metadata to be an actual
  class (``isinstance(..., type)``) before any further checks. This
  excludes module-level dataclass *instances* (``is_dataclass`` alone
  returns ``True`` for those) and other callables that could be
  invoked with attacker-controlled kwargs.
- Importing the metadata-named module can still run module-level
  side-effect code; the ``Struct`` docstring calls this out so callers
  don't load artifacts from untrusted sources.

Tests

- ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch,
  invalid format, equality/hash, spec.
- ``test_json_type.py`` — wire and storage round-trips.
- ``test_struct_type.py`` — dataclass round-trip, dict round-trip,
  directly-nested dataclass round-trip, container-field pass-through,
  rejection of non-dataclass and dataclass-instance metadata.
- ``test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through dataclass
  reconstruction, rejection of non-IOType classes in metadata, WIRE
  not supported on the bridge.
@saeidbarati157 saeidbarati157 force-pushed the feat/pluggable-serializers branch from 0044f28 to dcccf73 Compare April 18, 2026 02:58
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 18, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core: wire format for CLI/IPC, cross-language JSON bytes
  on storage, no pickle code-execution risk. ``Struct`` also walks
  directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))``
  round-trips back to its original type (generic containers like
  ``List[Inner]`` come back as raw JSON values — wrap those
  explicitly when you need richer reconstruction).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly schema emission — not enough on their own for core.
- Rich schema emission from ``Struct.to_spec()``. Extensions that ship
  primitive wrappers can override to emit fully-typed schemas; core
  just returns ``{"type": "struct"}``.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Safety

- ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize``
  both require the class named in artifact metadata to be an actual
  class (``isinstance(..., type)``) before any further checks. This
  excludes module-level dataclass *instances* (``is_dataclass`` alone
  returns ``True`` for those) and other callables that could be
  invoked with attacker-controlled kwargs.
- Importing the metadata-named module can still run module-level
  side-effect code; the ``Struct`` docstring calls this out so callers
  don't load artifacts from untrusted sources.

Tests

- ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch,
  invalid format, equality/hash, spec.
- ``test_json_type.py`` — wire and storage round-trips.
- ``test_struct_type.py`` — dataclass round-trip, dict round-trip,
  directly-nested dataclass round-trip, container-field pass-through,
  rejection of non-dataclass and dataclass-instance metadata.
- ``test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through dataclass
  reconstruction, rejection of non-IOType classes in metadata, WIRE
  not supported on the bridge.
Comment thread metaflow/datastore/task_datastore.py Outdated
@saeidbarati157 saeidbarati157 force-pushed the feat/pluggable-serializers branch from dcccf73 to 3e9a437 Compare April 18, 2026 03:16
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 18, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core: wire format for CLI/IPC, cross-language JSON bytes
  on storage, no pickle code-execution risk. ``Struct`` also walks
  directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))``
  round-trips back to its original type (generic containers like
  ``List[Inner]`` come back as raw JSON values — wrap those
  explicitly when you need richer reconstruction).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly schema emission — not enough on their own for core.
- Rich schema emission from ``Struct.to_spec()``. Extensions that ship
  primitive wrappers can override to emit fully-typed schemas; core
  just returns ``{"type": "struct"}``.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Safety

- ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize``
  both require the class named in artifact metadata to be an actual
  class (``isinstance(..., type)``) before any further checks. This
  excludes module-level dataclass *instances* (``is_dataclass`` alone
  returns ``True`` for those) and other callables that could be
  invoked with attacker-controlled kwargs.
- Importing the metadata-named module can still run module-level
  side-effect code; the ``Struct`` docstring calls this out so callers
  don't load artifacts from untrusted sources.

Tests

- ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch,
  invalid format, equality/hash, spec.
- ``test_json_type.py`` — wire and storage round-trips.
- ``test_struct_type.py`` — dataclass round-trip, dict round-trip,
  directly-nested dataclass round-trip, container-field pass-through,
  rejection of non-dataclass and dataclass-instance metadata.
- ``test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through dataclass
  reconstruction, rejection of non-IOType classes in metadata, WIRE
  not supported on the bridge.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 82.09256% with 89 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@eb38423). Learn more about missing BASE report.

Files with missing lines Patch % Lines
metaflow/datastore/artifacts/serializer.py 80.78% 44 Missing and 15 partials ⚠️
metaflow/datastore/task_datastore.py 77.02% 10 Missing and 7 partials ⚠️
metaflow/datastore/artifacts/lazy_registry.py 77.58% 9 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master    #3117   +/-   ##
=========================================
  Coverage          ?   26.10%           
=========================================
  Files             ?      380           
  Lines             ?    51908           
  Branches          ?     9139           
=========================================
  Hits              ?    13549           
  Misses            ?    37551           
  Partials          ?      808           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

Comment thread metaflow/datastore/artifacts/serializer.py
Comment thread metaflow/datastore/artifacts/serializer.py Outdated
Comment thread metaflow/datastore/artifacts/serializer.py Outdated
Comment thread metaflow/datastore/task_datastore.py
Comment thread metaflow/datastore/task_datastore.py Outdated
Comment thread metaflow/datastore/task_datastore.py Outdated
Saeid Barati and others added 6 commits April 23, 2026 15:12
Introduce ArtifactSerializer ABC with priority-based dispatch, enabling
custom serializers to be plugged in alongside the default pickle path.
This is the foundation for the IOType system which adds typed
serialization for Metaflow artifacts.

New abstractions:
- ArtifactSerializer: base class with can_serialize/can_deserialize/
  serialize/deserialize interface
- SerializerStore: metaclass for auto-registration and deterministic
  priority-ordered dispatch
- SerializationMetadata: namedtuple for artifact metadata routing
- SerializedBlob: supports both new bytes and references to
  already-stored data
- PickleSerializer: universal fallback (PRIORITY=9999) wrapping
  existing pickle logic

No existing code is modified. These are inert until wired into
TaskDataStore in a follow-up commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Register PickleSerializer through the standard Metaflow plugin
mechanism so extensions (e.g., mli-metaflow-custom) can add their
own serializers via ARTIFACT_SERIALIZERS_DESC.

- Add artifact_serializer category to _plugin_categories
- Add ARTIFACT_SERIALIZERS_DESC with PickleSerializer in plugins/__init__.py
- Resolve via ARTIFACT_SERIALIZERS = resolve_plugins("artifact_serializer")

Importing metaflow.plugins now triggers PickleSerializer registration
in SerializerStore, ensuring the registry is populated before
TaskDataStore needs it (commit 3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
save_artifacts now loops through priority-ordered serializers to find
one that can_serialize the object. load_artifacts routes deserialization
through metadata.encoding via can_deserialize. PickleSerializer handles
all existing Python objects as the universal fallback.

Behavioral changes:
- New artifacts get encoding "pickle-v4" (was "gzip+pickle-v4")
- _info[name] gains optional "serializer_info" dict for custom serializers
- Removed hardcoded pickle import from task_datastore.py

Backward compatible:
- Old artifacts with "gzip+pickle-v2" or "gzip+pickle-v4" encoding
  load correctly (PickleSerializer.can_deserialize handles both)
- Missing encoding defaults to "gzip+pickle-v2"
- Missing serializer_info defaults to {}
- _objects[name] stays as single string (multi-blob deferred to PR 1.5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SerializerStore now extends ABCMeta (not type) so @AbstractMethod
  is enforced — incomplete subclasses raise TypeError at definition
- Validate blobs non-empty before accessing blobs[0] in save_artifacts
- Add NOTE on compress_method not yet wired into save path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire vs storage format
- Added WIRE and STORAGE constants in metaflow.datastore.artifacts.serializer.
- ArtifactSerializer.serialize/deserialize now accept a ``format`` kwarg so a
  single class can own both the storage path (datastore blobs + metadata) and
  the wire path (string for CLI args, protobuf payloads, cross-process IPC).
- PickleSerializer implements STORAGE only; WIRE raises NotImplementedError
  with an explanation (pickle bytes are not safe as a wire payload).
- Serializers that want wire support implement it on the same class; there's
  no need for a second class per format.

Lazy import-hook registry
- New module metaflow/datastore/artifacts/lazy_registry.py with
  SerializerConfig, an importlib.abc.MetaPathFinder interceptor, and public
  register_serializer_for_type / load_serializer_class entry points.
- If the target type's module is already in sys.modules, registration is
  immediate. Otherwise a hook is installed on sys.meta_path and registration
  fires the first time the user's code imports the target module. This
  defers the cost of serializer-module imports (torch, pyarrow, fastavro,
  ...) until those dependencies are actually in play.
- find_spec temporarily removes the interceptor from sys.meta_path during
  its lookup to avoid recursion.

Review nits
- serializer.py: rename SerializationMetadata.type field to obj_type to avoid
  shadowing the type() builtin (dict key inside _info stays "type" for
  backward compatibility with existing datastores).
- serializer.py: SerializerStore skips TYPE=None AND any subclass that is
  still abstract, via inspect.isabstract().
- serializer.py: get_ordered_serializers memoizes the sorted list and
  invalidates on new registration. Drops the O(n²) list.index tiebreaker —
  Python 3.7+ dicts preserve insertion order.
- task_datastore.py: lift SerializerStore / SerializationMetadata imports to
  module top (were on hot paths: __init__ and load_artifacts).
- task_datastore.py: the "no serializer claimed this artifact" branch now
  raises DataException with a message pointing at the PickleSerializer
  fallback invariant, instead of the misleading UnpicklableArtifactException.
- task_datastore.py: "no deserializer claimed this artifact" now looks up
  ``serializer_info["source"]`` and hints at a missing extension.
- SerializedBlob.compress_method: removed. It was documented as "not wired
  into the save path" — shipping an unwired knob invites extension authors
  to rely on it and discover it is a no-op. Will come back with its
  consuming code in a later change.
- serialize() docstring rewritten: the side-effect-free contract is there
  to support retries, caching, and parallel dispatch, and to keep
  serializers testable; I/O belongs in hooks.

Tests
- New test/unit/test_lazy_serializer_registry.py (9 tests covering config
  validation, eager registration for already-imported types, deferred
  registration through the import hook, and the recursion guard).
- test_artifact_serializer.py gains format-dispatch coverage (STORAGE and
  WIRE round-trip on a toy dual-format serializer; PickleSerializer raises
  on WIRE).
- Removed tests for compress_method and adjusted tests that poked at
  _registration_order so they go through the public API only.
- Convert STORAGE/WIRE module constants to a SerializationFormat enum
  (str-backed so existing equality checks keep working). Module-level
  STORAGE/WIRE aliases are dropped.
- Add Python type hints to can_serialize, can_deserialize, serialize,
  and deserialize so editors surface the types documented in docstrings.
- Drop the unused `context` parameter from deserialize() across the ABC,
  PickleSerializer, the TaskDataStore call site, and tests.
- Move the `_serializers` property explanation onto the property itself;
  leave only a short note next to `_serializers_override`.
- save_artifacts now validates blob count BEFORE mutating `_info[name]`,
  so a rejected artifact never leaves partial metadata behind.
- Multi-blob error message now points users at the Metaflow team for
  multi-blob use cases.
- "No deserializer claimed artifact" error now includes the full
  serializer_info dict in addition to the source hint.
- SerializerStore.get_ordered_serializers tracks already-materialized
  lazy classes so the ordered cache is only rebuilt when a new lazy
  class actually becomes importable — previously, any registered lazy
  config forced a rebuild on every call.

Tests
- New regression test for the ordered-cache steady-state behavior.
- New regression tests verifying `_info[name]` is not populated when
  save_artifacts rejects empty or multi-blob serializer output.
- Updated existing tests for the enum, type hints, and context removal.
@saeidbarati157
Copy link
Copy Markdown
Author

Thanks Romain! Addressed in fed2c78:

  • serializer.py method declarations — added full type annotations to the ABC method signatures.
  • context param — stripped from deserialize across the ABC, PickleSerializer, the TaskDataStore caller, and tests. Can re-add if/when we have a concrete use case.
  • STORAGE/WIRE as an enum — promoted to SerializationFormat(str, Enum). Subclassing str keeps existing format == 'storage' comparisons working so no downstream migration is needed.
  • _serializers_override comment — reworded to describe the test override; moved the property-rationale block next to the property definition itself.
  • Multi-blob error — appended "at this time. If you have a need for multi blob serializers, please reach out to the Metaflow team."
  • Deserializer-not-found error — now prints the serializer_info dict alongside encoding.

All 77 serializer/pickle/lazy/integration unit tests pass.

saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 23, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core: wire format for CLI/IPC, cross-language JSON bytes
  on storage, no pickle code-execution risk. ``Struct`` also walks
  directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))``
  round-trips back to its original type (generic containers like
  ``List[Inner]`` come back as raw JSON values — wrap those
  explicitly when you need richer reconstruction).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly schema emission — not enough on their own for core.
- Rich schema emission from ``Struct.to_spec()``. Extensions that ship
  primitive wrappers can override to emit fully-typed schemas; core
  just returns ``{"type": "struct"}``.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Safety

- ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize``
  both require the class named in artifact metadata to be an actual
  class (``isinstance(..., type)``) before any further checks. This
  excludes module-level dataclass *instances* (``is_dataclass`` alone
  returns ``True`` for those) and other callables that could be
  invoked with attacker-controlled kwargs.
- Importing the metadata-named module can still run module-level
  side-effect code; the ``Struct`` docstring calls this out so callers
  don't load artifacts from untrusted sources.

Tests

- ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch,
  invalid format, equality/hash, spec.
- ``test_json_type.py`` — wire and storage round-trips.
- ``test_struct_type.py`` — dataclass round-trip, dict round-trip,
  directly-nested dataclass round-trip, container-field pass-through,
  rejection of non-dataclass and dataclass-instance metadata.
- ``test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through dataclass
  reconstruction, rejection of non-IOType classes in metadata, WIRE
  not supported on the bridge.
Replace the two-path registration scheme (ARTIFACT_SERIALIZERS_DESC +
register_serializer_for_type) with a single declarative mechanism.
Extensions ship one serializer file per serializer; heavy imports
live in a setup_imports() classmethod whose ModuleNotFoundError
parks the serializer on an import hook and retries once the awaited
module appears in sys.modules.

Author-facing API
-----------------

    class TorchSerializer(ArtifactSerializer):
        TYPE = "torch"
        PRIORITY = 50

        @classmethod
        def setup_imports(cls, context=None):
            cls.lazy_import("torch")
            cls.lazy_import("pyarrow", alias="pa")

        @classmethod
        def can_serialize(cls, obj):
            return isinstance(obj, cls.torch.Tensor)
        ...

    # mfextinit_myext.py
    ARTIFACT_SERIALIZERS_DESC = [
        ("torch", "my_ext.serializers.torch_serializer.TorchSerializer"),
    ]

Mechanism
---------

- SerializerStore.bootstrap() walks every extension's
  ARTIFACT_SERIALIZERS_DESC directly (the serializer category owns its
  own lifecycle rather than flowing through resolve_plugins), applies
  ENABLED_ARTIFACT_SERIALIZER +/- toggles, and drives each entry
  through a state machine:
      known → importing → class_loaded → importing_deps → active
  with terminal states `broken` / `disabled` and the retry state
  `pending_on_imports`.
- lazy_import(module_path, alias=None) imports the module, stashes it
  on cls, returns it. Rejects reserved names (TYPE, PRIORITY, dispatch
  methods, leading underscore) and double-assignment within one
  setup_imports() call.
- A sys.meta_path interceptor watches the pending modules; when any
  imports, SerializerStore._on_module_imported re-runs the lifecycle
  for every parked entry waiting on that module. Loop guard: the same
  ImportError.name raising twice → broken.
- Dispatch reads _active_serializers only. Exceptions raised by
  can_serialize / can_deserialize on a currently-active serializer are
  caught, counted in the record's dispatch_error_count, and the
  serializer is skipped for that artifact only — preserving the
  PickleSerializer fallback.
- On PRIORITY ties, last-registered wins; a lexicographic tiebreak on
  class_path provides cross-environment reproducibility.

Diagnostic API
--------------

metaflow.datastore.artifacts.list_serializer_status() returns a list
of per-entry dicts with name, class_path, state, awaiting_modules,
last_error, priority, type, import_trigger, and dispatch_error_count.
Primary use case: answering "why isn't my custom serializer active?"
without reading source.

Removed
-------

- register_serializer_for_type() (public API)
- SerializerConfig, register_serializer_config, iter_registered_configs,
  load_serializer_class, plus the backing declarative-config plumbing
  inside lazy_registry.py — nothing in the unified path uses them.
- ARTIFACT_SERIALIZERS = resolve_plugins("artifact_serializer") in
  metaflow/plugins/__init__.py (no readers in the codebase).
- "artifact_serializer" from _plugin_categories in
  metaflow/extension_support/plugins.py.

Testing
-------

- 102 unit tests across test_artifact_serializer, test_pickle_serializer,
  test_serializer_integration, test_serializer_lifecycle, and
  test_serializer_public_api. Coverage: state-machine transitions,
  retry and loop guard, disabled toggle, dispatch exception handling,
  lazy_import reserved-name + collision guards, setup_imports signature
  compatibility (both def setup_imports(cls) and
  def setup_imports(cls, context=None) are supported), subclass
  inheritance, and public API surface assertions.
- SerializerStore._reset_for_tests() clears registry + interceptor state
  and walks MRO to delattr lazy-imported attributes for test isolation.
@romain-intel romain-intel force-pushed the feat/pluggable-serializers branch from fed2c78 to b6f6cf4 Compare April 24, 2026 07:04
Comment thread metaflow/datastore/artifacts/serializer.py Outdated
When save_artifacts dispatches to a serializer, inject the serializer's
source label into ``serializer_info["source"]`` so the "no deserializer
claimed artifact" load error can point at the package providing the
missing serializer. The source is derived at bootstrap time:
``"metaflow"`` for core serializers, the extension's ``package_name``
(via ``metaflow.extension_support.plugins.get_modules``) for extensions.

Authors who set their own ``source`` in the returned ``serializer_info``
are not overridden, so an extension can publish a different identifier
(e.g. a logical name rather than the pip distribution name) by passing
it through ``serialize()``.

The source is also exposed on each diagnostic record returned by
``list_serializer_status()``.
Comment thread metaflow/plugins/datastores/serializers/pickle_serializer.py Outdated
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment on lines +569 to +601
def test_lazy_import_rejects_double_assignment():
"""Calling lazy_import twice with the same alias on the same cls raises."""

class _LazyDup(ArtifactSerializer):
TYPE = "test_lazy_dup"

@classmethod
def can_serialize(cls, obj):
return False

@classmethod
def can_deserialize(cls, metadata):
return False

@classmethod
def serialize(cls, obj, format=SerializationFormat.STORAGE):
raise NotImplementedError

@classmethod
def deserialize(cls, data, metadata=None, format=SerializationFormat.STORAGE):
raise NotImplementedError

try:
_LazyDup.lazy_import("json")
with pytest.raises(ValueError, match="already set"):
_LazyDup.lazy_import("sys", alias="json")
finally:
SerializerStore._all_serializers.pop("test_lazy_dup", None)
SerializerStore._ordered_cache = None
if hasattr(_LazyDup, "json"):
delattr(_LazyDup, "json")
if hasattr(_LazyDup, "_lazy_imported_names"):
delattr(_LazyDup, "_lazy_imported_names")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Test contradicts the new idempotent lazy_import behaviour

test_lazy_import_rejects_double_assignment asserts that calling lazy_import("sys", alias="json") after lazy_import("json") raises ValueError("already set"), but the implementation was changed in this PR to return idempotently for any already-registered alias (lines 614-618 of serializer.py). The test will fail with Failed: DID NOT RAISE <class 'ValueError'> because the idempotent branch fires first and returns the cached json module, regardless of the differing module_path argument.

The two cases need to be distinguished: re-importing the same module under the same alias (the retry use-case the idempotent fix was meant to support) should be silent, while binding a different module to an existing alias (a programming error, which this test guards) should still raise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants