[marin] _discover_files treats artifact.json sidecar as JSONL data

**TL;DR**

`_discover_files` matches `.json` and so sweeps the executor sidecar
`artifact.json` in as a data file; `load_jsonl` then parses it as
JSONL. Production symptom today: transforms whose `fn` returns `None`
write the literal `null` to `artifact.json`, and normalize crashes
with `AttributeError: 'NoneType' object has no attribute 'get'`. The
deeper bug is type mis-detection — a single-document JSON file is
treated as a JSONL stream — so other sidecar shapes fail differently
(see table below). Regressed by #5732 which renamed `.artifact`
(dotfile, skipped) → `artifact.json` (not skipped). Fix: skip marin
sidecars by basename in `_discover_files`.

**Failure modes (all same root cause)**

| `artifact.json` payload | Outcome |
|---|---|
| `null` — what `Artifact.save(None, ...)` writes | `AttributeError` in `has_text` *(verified in production)* |
| Multi-line pretty-printed JSON | `msgspec.DecodeError: Input data was truncated` *(synthetic)* |
| Single-line JSON dict with a `text` field | Silent phantom record in output *(synthetic)* |

**To Reproduce**

```python
import os, tempfile, json
import pyarrow as pa, pyarrow.parquet as pq
from marin.datakit.normalize import normalize_to_parquet

with tempfile.TemporaryDirectory() as d:
    pq.write_table(
        pa.table({"id": ["a"], "text": ["hello"]}),
        os.path.join(d, "data-00000-of-00001.parquet"),
    )
    with open(os.path.join(d, "artifact.json"), "w") as f:
        f.write("null")  # or '{\n"x":1\n}' or json.dumps({"text": "phantom"})
    normalize_to_parquet(input_path=d, output_path=os.path.join(d, "out"))
```

**Expected behavior**

`_discover_files` should not surface marin executor sidecars
(`artifact.json`, `provenance.json`) as data files. Skipping by exact
basename — the same way dotfiles are skipped today — is the natural
fix.

**Additional context**

- Discovery / extension filter: `lib/marin/src/marin/datakit/normalize.py:198-208`
- Artifact filename constant: `lib/marin/src/marin/execution/artifact.py:18`
- Same class of bug in `decon._discover_eval_files`
  (`lib/marin/src/marin/datakit/decon.py:179-197`): docstring claims
  `provenance.json` is skipped, code does not.
- Broader cleanup option: drop `.json` from
  `zephyr.readers.SUPPORTED_EXTENSIONS` so single-document JSON can't
  masquerade as JSONL.


`artifact.json` payload	Outcome
`null` — what `Artifact.save(None, ...)` writes	`AttributeError` in `has_text` (verified in production)
Multi-line pretty-printed JSON	`msgspec.DecodeError: Input data was truncated` (synthetic)
Single-line JSON dict with a `text` field	Silent phantom record in output (synthetic)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[marin] _discover_files treats artifact.json sidecar as JSONL data #5864

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[marin] _discover_files treats artifact.json sidecar as JSONL data #5864

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions