TL;DR
_discover_files matches .json and so sweeps the executor sidecar
artifact.json in as a data file; load_jsonl then parses it as
JSONL. Production symptom today: transforms whose fn returns None
write the literal null to artifact.json, and normalize crashes
with AttributeError: 'NoneType' object has no attribute 'get'. The
deeper bug is type mis-detection — a single-document JSON file is
treated as a JSONL stream — so other sidecar shapes fail differently
(see table below). Regressed by #5732 which renamed .artifact
(dotfile, skipped) → artifact.json (not skipped). Fix: skip marin
sidecars by basename in _discover_files.
Failure modes (all same root cause)
artifact.json payload |
Outcome |
null — what Artifact.save(None, ...) writes |
AttributeError in has_text (verified in production) |
| Multi-line pretty-printed JSON |
msgspec.DecodeError: Input data was truncated (synthetic) |
Single-line JSON dict with a text field |
Silent phantom record in output (synthetic) |
To Reproduce
import os, tempfile, json
import pyarrow as pa, pyarrow.parquet as pq
from marin.datakit.normalize import normalize_to_parquet
with tempfile.TemporaryDirectory() as d:
pq.write_table(
pa.table({"id": ["a"], "text": ["hello"]}),
os.path.join(d, "data-00000-of-00001.parquet"),
)
with open(os.path.join(d, "artifact.json"), "w") as f:
f.write("null") # or '{\n"x":1\n}' or json.dumps({"text": "phantom"})
normalize_to_parquet(input_path=d, output_path=os.path.join(d, "out"))
Expected behavior
_discover_files should not surface marin executor sidecars
(artifact.json, provenance.json) as data files. Skipping by exact
basename — the same way dotfiles are skipped today — is the natural
fix.
Additional context
- Discovery / extension filter:
lib/marin/src/marin/datakit/normalize.py:198-208
- Artifact filename constant:
lib/marin/src/marin/execution/artifact.py:18
- Same class of bug in
decon._discover_eval_files
(lib/marin/src/marin/datakit/decon.py:179-197): docstring claims
provenance.json is skipped, code does not.
- Broader cleanup option: drop
.json from
zephyr.readers.SUPPORTED_EXTENSIONS so single-document JSON can't
masquerade as JSONL.
TL;DR
_discover_filesmatches.jsonand so sweeps the executor sidecarartifact.jsonin as a data file;load_jsonlthen parses it asJSONL. Production symptom today: transforms whose
fnreturnsNonewrite the literal
nulltoartifact.json, and normalize crasheswith
AttributeError: 'NoneType' object has no attribute 'get'. Thedeeper bug is type mis-detection — a single-document JSON file is
treated as a JSONL stream — so other sidecar shapes fail differently
(see table below). Regressed by #5732 which renamed
.artifact(dotfile, skipped) →
artifact.json(not skipped). Fix: skip marinsidecars by basename in
_discover_files.Failure modes (all same root cause)
artifact.jsonpayloadnull— whatArtifact.save(None, ...)writesAttributeErrorinhas_text(verified in production)msgspec.DecodeError: Input data was truncated(synthetic)textfieldTo Reproduce
Expected behavior
_discover_filesshould not surface marin executor sidecars(
artifact.json,provenance.json) as data files. Skipping by exactbasename — the same way dotfiles are skipped today — is the natural
fix.
Additional context
lib/marin/src/marin/datakit/normalize.py:198-208lib/marin/src/marin/execution/artifact.py:18decon._discover_eval_files(
lib/marin/src/marin/datakit/decon.py:179-197): docstring claimsprovenance.jsonis skipped, code does not..jsonfromzephyr.readers.SUPPORTED_EXTENSIONSso single-document JSON can'tmasquerade as JSONL.