Skip to content

[marin] _discover_files treats artifact.json sidecar as JSONL data #5864

@Helw150

Description

@Helw150

TL;DR

_discover_files matches .json and so sweeps the executor sidecar
artifact.json in as a data file; load_jsonl then parses it as
JSONL. Production symptom today: transforms whose fn returns None
write the literal null to artifact.json, and normalize crashes
with AttributeError: 'NoneType' object has no attribute 'get'. The
deeper bug is type mis-detection — a single-document JSON file is
treated as a JSONL stream — so other sidecar shapes fail differently
(see table below). Regressed by #5732 which renamed .artifact
(dotfile, skipped) → artifact.json (not skipped). Fix: skip marin
sidecars by basename in _discover_files.

Failure modes (all same root cause)

artifact.json payload Outcome
null — what Artifact.save(None, ...) writes AttributeError in has_text (verified in production)
Multi-line pretty-printed JSON msgspec.DecodeError: Input data was truncated (synthetic)
Single-line JSON dict with a text field Silent phantom record in output (synthetic)

To Reproduce

import os, tempfile, json
import pyarrow as pa, pyarrow.parquet as pq
from marin.datakit.normalize import normalize_to_parquet

with tempfile.TemporaryDirectory() as d:
    pq.write_table(
        pa.table({"id": ["a"], "text": ["hello"]}),
        os.path.join(d, "data-00000-of-00001.parquet"),
    )
    with open(os.path.join(d, "artifact.json"), "w") as f:
        f.write("null")  # or '{\n"x":1\n}' or json.dumps({"text": "phantom"})
    normalize_to_parquet(input_path=d, output_path=os.path.join(d, "out"))

Expected behavior

_discover_files should not surface marin executor sidecars
(artifact.json, provenance.json) as data files. Skipping by exact
basename — the same way dotfiles are skipped today — is the natural
fix.

Additional context

  • Discovery / extension filter: lib/marin/src/marin/datakit/normalize.py:198-208
  • Artifact filename constant: lib/marin/src/marin/execution/artifact.py:18
  • Same class of bug in decon._discover_eval_files
    (lib/marin/src/marin/datakit/decon.py:179-197): docstring claims
    provenance.json is skipped, code does not.
  • Broader cleanup option: drop .json from
    zephyr.readers.SUPPORTED_EXTENSIONS so single-document JSON can't
    masquerade as JSONL.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-generatedCreated by automation/agentbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions