[nightshift] 20260611 multi-cleanup by claude-nightshift[bot] · Pull Request #6327 · marin-community/marin

claude-nightshift · 2026-06-11T14:12:25Z

seed 75357b1d
Dead lines fall away —
one helper holds the contract,
clean diffs greet the dawn.

Four independent, behavior-preserving cleanups, one per subproject.

marin/datakit: normalize._make_split_writer duplicated the
part-NNNNN-of-MMMMM.parquet shard-name format string inline twice instead of
using datakit.partition_filename, the canonical helper written for exactly this
purpose. Both the main and dups output paths now route through the helper, so
the naming contract that consolidate's filename-based join depends on lives in
one place. Output paths are identical; tests/datakit/test_normalize.py (15)
passes.

levanter/data: Removed a closed cluster of dead code from _preprocessor.py
(_construct_composite_batch_processor, _CompositeBatchProcessor,
as_record_batch — 116 lines). This composite-transform pathway has been
unreferenced since it was added in 2023; the three symbols only referenced one
another and nothing in the repo or downstream marin consumed them. The
transform machinery still used by sharded_datasource (_MapTransform,
_BatchMapTransform, _TransformedDataset) and the still-imported BatchResult /
dict_from_record_batch are untouched.

iris/k8s: Five k8s call sites independently parsed Kubernetes RFC3339
timestamps with datetime.fromisoformat(s.replace("Z", "+00:00")), and the
kubectl log-line parser additionally carried a manual fractional-second
truncation block working around a pre-3.11 fromisoformat limitation. On the
supported Python range (>=3.11,<3.13) fromisoformat truncates sub-microsecond
fractions natively, so that block was dead. Added parse_k8s_timestamp in
k8s/types.py alongside parse_k8s_quantity/parse_k8s_cpu, routed all call sites
through it, dropped the dead truncation block and a now-unused datetime import,
and added parametrized tests (Z suffix, explicit offset, microsecond,
nanosecond truncation, malformed-input rejection).

zephyr/plan: The physical Write op was the lone violator of the plan module's
design that physical ops encapsulate execution as callables — it carried a
stringly-typed writer_type plus a schema field and forced run_stage into a
4-way if/elif that imported every writer function. A _writer_for(writer_type,
schema) factory now binds the writer to its callable at plan-build time
(functools.partial binds schema for parquet/vortex); Write carries a single
write_fn and run_stage just calls op.write_fn(stream, output_path). The
user-facing WriteOp in dataset.py keeps its Literal-typed writer_type.

Affected test suites pass: iris k8s parsers (18), zephyr plan/dataset/backends/
execution/groupby/writers/optimization (189), datakit normalize (15), levanter
sharded_dataset/newdataset (11).

…me helper normalize._make_split_writer duplicated the part-NNNNN-of-MMMMM.parquet format string inline twice instead of using datakit.partition_filename, the canonical helper written for exactly this purpose. Route both the main and dups output paths through the helper so the naming contract lives in one place.

Drop _construct_composite_batch_processor, _CompositeBatchProcessor, and as_record_batch from data/_preprocessor.py. This composite-transform pathway has been unreferenced since it was added in 2023; the three symbols only referenced one another and nothing in the repo (or downstream marin) consumed them. The transform classes still used by sharded_datasource are untouched.

Five k8s call sites independently parsed Kubernetes RFC3339 timestamps with datetime.fromisoformat(s.replace("Z", "+00:00")). One of them (the kubectl log-line parser) also carried a manual fractional-second truncation block that worked around a pre-3.11 fromisoformat limitation. On the supported Python range (>=3.11,<3.13) fromisoformat truncates sub-microsecond fractions itself, so that block is dead. Centralize the parse into parse_k8s_timestamp in k8s/types.py alongside the existing parse_k8s_quantity/parse_k8s_cpu helpers and route all call sites through it.

The physical Write op was the only one carrying a stringly-typed writer_type (plus a schema field) and forcing run_stage to branch on it and import every writer function, contradicting the plan module's stated design that physical ops encapsulate execution as callables. Resolve the writer at plan time via _writer_for() so run_stage just calls op.write_fn, decoupling it from concrete output formats.

Nightshift Agent added 4 commits June 11, 2026 14:04

claude-nightshift Bot added agent-generated Created by automation/agent nightshift Automated nightshift fixes labels Jun 11, 2026

claude-nightshift Bot requested a review from rjpower June 11, 2026 14:12

claude-nightshift Bot enabled auto-merge (squash) June 11, 2026 14:12

[nightshift] re-trigger CI (flaky levanter-tpu-tests timeout)

0e0e032

rjpower approved these changes Jun 12, 2026

View reviewed changes

claude-nightshift Bot merged commit 92d9542 into main Jun 12, 2026
33 checks passed

claude-nightshift Bot deleted the nightshift/cleanup-20260611 branch June 12, 2026 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nightshift] 20260611 multi-cleanup#6327

[nightshift] 20260611 multi-cleanup#6327
claude-nightshift[bot] merged 5 commits into
mainfrom
nightshift/cleanup-20260611

claude-nightshift Bot commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

claude-nightshift Bot commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant