Add IOType system with typed artifact serialization#3118
Add IOType system with typed artifact serialization#3118saeidbarati157 wants to merge 12 commits intoNetflix:masterfrom
Conversation
Introduce ArtifactSerializer ABC with priority-based dispatch, enabling custom serializers to be plugged in alongside the default pickle path. This is the foundation for the IOType system which adds typed serialization for Metaflow artifacts. New abstractions: - ArtifactSerializer: base class with can_serialize/can_deserialize/ serialize/deserialize interface - SerializerStore: metaclass for auto-registration and deterministic priority-ordered dispatch - SerializationMetadata: namedtuple for artifact metadata routing - SerializedBlob: supports both new bytes and references to already-stored data - PickleSerializer: universal fallback (PRIORITY=9999) wrapping existing pickle logic No existing code is modified. These are inert until wired into TaskDataStore in a follow-up commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Register PickleSerializer through the standard Metaflow plugin
mechanism so extensions (e.g., mli-metaflow-custom) can add their
own serializers via ARTIFACT_SERIALIZERS_DESC.
- Add artifact_serializer category to _plugin_categories
- Add ARTIFACT_SERIALIZERS_DESC with PickleSerializer in plugins/__init__.py
- Resolve via ARTIFACT_SERIALIZERS = resolve_plugins("artifact_serializer")
Importing metaflow.plugins now triggers PickleSerializer registration
in SerializerStore, ensuring the registry is populated before
TaskDataStore needs it (commit 3).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
save_artifacts now loops through priority-ordered serializers to find
one that can_serialize the object. load_artifacts routes deserialization
through metadata.encoding via can_deserialize. PickleSerializer handles
all existing Python objects as the universal fallback.
Behavioral changes:
- New artifacts get encoding "pickle-v4" (was "gzip+pickle-v4")
- _info[name] gains optional "serializer_info" dict for custom serializers
- Removed hardcoded pickle import from task_datastore.py
Backward compatible:
- Old artifacts with "gzip+pickle-v2" or "gzip+pickle-v4" encoding
load correctly (PickleSerializer.can_deserialize handles both)
- Missing encoding defaults to "gzip+pickle-v2"
- Missing serializer_info defaults to {}
- _objects[name] stays as single string (multi-blob deferred to PR 1.5)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SerializerStore now extends ABCMeta (not type) so @AbstractMethod is enforced — incomplete subclasses raise TypeError at definition - Validate blobs non-empty before accessing blobs[0] in save_artifacts - Add NOTE on compress_method not yet wired into save path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
romain-intel
left a comment
There was a problem hiding this comment.
Initial comments:
- the auto registration on types doesn't seem to be included here
- we have to verify that client will work too (ie: serialization_metadata
- something about compression (does it work right now)?
- need to discuss io_types (and that may be in an extension since may be a bit specific).
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #3118 +/- ##
=========================================
Coverage ? 26.79%
=========================================
Files ? 387
Lines ? 51937
Branches ? 9117
=========================================
Hits ? 13919
Misses ? 37223
Partials ? 795 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
603a087 to
15d649a
Compare
Greptile SummaryThis PR ships the typed-artifact layer for Metaflow: the Confidence Score: 5/5Safe to merge — all previously flagged P1 concerns are resolved, and the only remaining finding is a P2 ergonomics issue for future extension authors. Every P0/P1 issue from prior review rounds has been addressed: routing-key precedence is correct, multi-blob raises loudly, No files require special attention. Important Files Changed
Reviews (5): Last reviewed commit: "Harden P1 fixes: frozen dataclasses + nu..." | Re-trigger Greptile |
15d649a to
8c94f49
Compare
Wire vs storage format - Added WIRE and STORAGE constants in metaflow.datastore.artifacts.serializer. - ArtifactSerializer.serialize/deserialize now accept a ``format`` kwarg so a single class can own both the storage path (datastore blobs + metadata) and the wire path (string for CLI args, protobuf payloads, cross-process IPC). - PickleSerializer implements STORAGE only; WIRE raises NotImplementedError with an explanation (pickle bytes are not safe as a wire payload). - Serializers that want wire support implement it on the same class; there's no need for a second class per format. Lazy import-hook registry - New module metaflow/datastore/artifacts/lazy_registry.py with SerializerConfig, an importlib.abc.MetaPathFinder interceptor, and public register_serializer_for_type / load_serializer_class entry points. - If the target type's module is already in sys.modules, registration is immediate. Otherwise a hook is installed on sys.meta_path and registration fires the first time the user's code imports the target module. This defers the cost of serializer-module imports (torch, pyarrow, fastavro, ...) until those dependencies are actually in play. - find_spec temporarily removes the interceptor from sys.meta_path during its lookup to avoid recursion. Review nits - serializer.py: rename SerializationMetadata.type field to obj_type to avoid shadowing the type() builtin (dict key inside _info stays "type" for backward compatibility with existing datastores). - serializer.py: SerializerStore skips TYPE=None AND any subclass that is still abstract, via inspect.isabstract(). - serializer.py: get_ordered_serializers memoizes the sorted list and invalidates on new registration. Drops the O(n²) list.index tiebreaker — Python 3.7+ dicts preserve insertion order. - task_datastore.py: lift SerializerStore / SerializationMetadata imports to module top (were on hot paths: __init__ and load_artifacts). - task_datastore.py: the "no serializer claimed this artifact" branch now raises DataException with a message pointing at the PickleSerializer fallback invariant, instead of the misleading UnpicklableArtifactException. - task_datastore.py: "no deserializer claimed this artifact" now looks up ``serializer_info["source"]`` and hints at a missing extension. - SerializedBlob.compress_method: removed. It was documented as "not wired into the save path" — shipping an unwired knob invites extension authors to rely on it and discover it is a no-op. Will come back with its consuming code in a later change. - serialize() docstring rewritten: the side-effect-free contract is there to support retries, caching, and parallel dispatch, and to keep serializers testable; I/O belongs in hooks. Tests - New test/unit/test_lazy_serializer_registry.py (9 tests covering config validation, eager registration for already-imported types, deferred registration through the import hook, and the recursion guard). - test_artifact_serializer.py gains format-dispatch coverage (STORAGE and WIRE round-trip on a toy dual-format serializer; PickleSerializer raises on WIRE). - Removed tests for compress_method and adjusted tests that poked at _registration_order so they go through the public API only.
8c94f49 to
358f3cd
Compare
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
- Add type annotations to ArtifactSerializer method signatures for better editor/IDE support. - Remove unused ``context`` parameter from ``deserialize`` (call sites never passed a meaningful value). - Promote ``STORAGE`` / ``WIRE`` to a ``SerializationFormat`` enum; subclass ``str`` so the existing string-literal comparisons keep working. - Reword the ``_serializers_override`` comment and move the property rationale next to the property definition. - Extend the multi-blob error message with "at this time. If you have a need for multi blob serializers, please reach out to the Metaflow team.". - Include ``serializer_info`` in the "No deserializer claimed artifact" error message.
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core: wire format for CLI/IPC, cross-language JSON bytes on storage, no pickle code-execution risk. ``Struct`` also walks directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))`` round-trips back to its original type (generic containers like ``List[Inner]`` come back as raw JSON values — wrap those explicitly when you need richer reconstruction). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly schema emission — not enough on their own for core. - Rich schema emission from ``Struct.to_spec()``. Extensions that ship primitive wrappers can override to emit fully-typed schemas; core just returns ``{"type": "struct"}``. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Safety - ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize`` both require the class named in artifact metadata to be an actual class (``isinstance(..., type)``) before any further checks. This excludes module-level dataclass *instances* (``is_dataclass`` alone returns ``True`` for those) and other callables that could be invoked with attacker-controlled kwargs. - Importing the metadata-named module can still run module-level side-effect code; the ``Struct`` docstring calls this out so callers don't load artifacts from untrusted sources. Tests - ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test_json_type.py`` — wire and storage round-trips. - ``test_struct_type.py`` — dataclass round-trip, dict round-trip, directly-nested dataclass round-trip, container-field pass-through, rejection of non-dataclass and dataclass-instance metadata. - ``test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata, WIRE not supported on the bridge.
The metadata service stores only the artifact encoding string
("iotype:<type_name>"), not the full serializer_info dict. So when the
Flow client reconstructs an artifact, IOTypeSerializer.deserialize had no
iotype_module/iotype_class to import and KeyError'd.
Populate a registry via IOType.__init_subclass__ keyed by type_name, and
have IOTypeSerializer.deserialize resolve the class from the encoding
suffix first. iotype_module/iotype_class hints remain a secondary lookup
for introspecting artifacts whose owning extension isn't loaded locally.
Fixes round-trip of IOType artifacts through Flow(...).latest_run
.end_task[...].data without touching the metadata service or client
core.py.
``dataclasses.fields()`` includes fields declared with ``field(init=False, ...)``, but those are not accepted by the generated ``__init__``. Passing them as kwargs raised ``TypeError: __init__() got an unexpected keyword argument``. ``dataclasses.asdict`` does include them in its output, so the storage blob carried them. Fix: partition fields into init-eligible kwargs and post-construction ``setattr`` fields. The serialized value wins over ``__post_init__``'s recomputation or a field default. Adds a regression test that mutates the init=False field after construction and verifies the mutated value survives a storage round-trip.
The base ``IOType.__hash__`` does ``hash((type(self), self._value))``, which
raises ``TypeError`` when ``_value`` is inherently unhashable — a dict
wrapped by ``Json({...})`` or a dataclass with mutable fields (``list``,
``dict``) wrapped by ``Struct(...)``. Both are the primary ways users
create instances of these types, so the wrapper ends up mis-behaving in
any set/dict or cache keyed by artifacts.
Override ``__hash__`` on both subclasses to hash the canonical JSON
representation (already produced by the wire/storage path). This is:
- Stable — ``sort_keys=True`` makes the JSON deterministic across equal
dicts/dataclasses regardless of key insertion order.
- Contract-safe — ``__eq__`` compares raw values; equal values flatten to
identical sorted-key JSON, so ``a == b`` implies ``hash(a) == hash(b)``.
- Descriptor-aware — when ``_value is _UNSET`` (no value, pure type
descriptor), fall back to hashing the sentinel directly.
Adds regression tests covering ``Json({...})``, ``Json([...])``,
``Struct(dc_with_list)``, plain ``Struct({...})``, and the ``Struct()``
descriptor.
Self-review before pushing caught two subtle regressions in the prior two
commits:
1. ``_reconstruct`` used plain ``setattr`` to restore ``init=False`` fields
after construction, which raises ``FrozenInstanceError`` on
``@dataclass(frozen=True)``. Switch to ``object.__setattr__`` — the same
bypass frozen dataclasses themselves use in ``__post_init__``.
2. ``Json.__hash__`` / ``Struct.__hash__`` hashed the canonical JSON string.
That renders ``1`` and ``1.0`` as distinct strings, violating the
``__eq__`` / ``__hash__`` contract when users mix numeric types
(``1 == 1.0 == True`` in Python, but ``hash('1') != hash('1.0')``).
Replace with a recursive ``_make_hashable`` that converts dicts to
``frozenset`` of items and lists to ``tuple``s. Python's built-in
hashing then collapses ``1 == 1.0 == True`` to the same bucket, and
equal wrapped values produce equal hashes.
``_make_hashable`` lives in ``base.py`` so any future IOType wrapping
unhashable values can reuse it.
Regression tests added for:
- Frozen dataclass with ``init=False`` field round-trip.
- ``Json({'x': 1}) == Json({'x': 1.0})`` and ``hash()`` equality.
- Same for ``Struct({'x': 1})`` / ``Struct({'x': 1.0})``.
- ``Json({'x': True}) == Json({'x': 1})`` hash equality.
- ``Json([1, 2, 3]) == Json([1.0, 2.0, 3.0])`` hash equality.
07d78a0 to
4e1dfb8
Compare
|
Latest push ( Addresses both P1s Greptile flagged:
Other cleanup folded into the rebase:
Stale review threads resolved — the 6 Greptile re-review: 5/5 (up from 3/5). 120/120 io_types + pluggable-serializer unit tests pass. Optional P2 remaining (per Greptile summary): the base |
Summary
On top of #3117, this ships the typed-artifact layer for Metaflow. Scope kept small — the pieces here pull their weight in core; everything else belongs in extensions.
IOTypeABC — the contract extension authors target.JsonandStruct— two concrete types with clear standalone value: wire format for CLI/IPC, cross-language JSON bytes on storage, no pickle code-execution risk.Structalso walks directly-nested@dataclassfields soOuter(inner=Inner(...))round-trips back to its original type.IOTypeSerializer— the datastore bridge that plugs anyIOTypeinto theArtifactSerializerdispatch from Add pluggable artifact serializer framework #3117, so save/load through the datastore just works (PRIORITY=50, STORAGE-only).Intentionally not included
Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow throughPickleSerializerunchanged. Wrapping is opt-in, for cases that need constraints/metadata attached.Tensor. Numpy + byte-order/dtype opinions; belongs in an extension that can own those choices.List/Map/Enum. Thin wrappers whose value over plain JSON is mostly schema emission — not enough on their own for core.Struct.to_spec(). Extensions that ship primitive wrappers can override to emit fully-typed schemas; core just returns{"type": "struct"}.Design
serialize(format=...)/deserialize(data, format=..., **kw)mirror theArtifactSerializersignature from Add pluggable artifact serializer framework #3117. SameWIRE/STORAGEconstants, so a single subclass owns both representations.STORAGE→(List[SerializedBlob], metadata_dict)for the datastore save path.WIRE→strfor CLI args, protobuf payloads, cross-process IPC._wire_serialize,_wire_deserialize,_storage_serialize,_storage_deserialize.TypeError.IOTypeSerializerhandles onlySTORAGE; wire encoding is produced by callingIOType.serialize(format=WIRE)directly.Safety
Struct._storage_deserializeandIOTypeSerializer.deserializeboth require the class named in artifact metadata to be an actual class (isinstance(..., type)) before any further checks. This excludes module-level dataclass instances (whichis_dataclass()alone considers valid) and other callables that could be invoked with attacker-controlled kwargs.Structdocstring calls this out — don't load artifacts from untrusted sources.Test plan
test_base.py— abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec.test_json_type.py— wire and storage round-trips.test_struct_type.py— dataclass round-trip, dict round-trip, directly-nested dataclass round-trip, container-field pass-through, rejection of non-dataclass and dataclass-instance metadata.test_iotype_serializer.py— bridgecan_serialize/can_deserialize, dataclass round-trip, WIRE not supported on the bridge, security (rejects non-IOType classes in metadata).