[nightshift] 20260611 multi-cleanup#6327
Merged
Merged
Conversation
added 4 commits
June 11, 2026 14:04
…me helper normalize._make_split_writer duplicated the part-NNNNN-of-MMMMM.parquet format string inline twice instead of using datakit.partition_filename, the canonical helper written for exactly this purpose. Route both the main and dups output paths through the helper so the naming contract lives in one place.
Drop _construct_composite_batch_processor, _CompositeBatchProcessor, and as_record_batch from data/_preprocessor.py. This composite-transform pathway has been unreferenced since it was added in 2023; the three symbols only referenced one another and nothing in the repo (or downstream marin) consumed them. The transform classes still used by sharded_datasource are untouched.
Five k8s call sites independently parsed Kubernetes RFC3339 timestamps with
datetime.fromisoformat(s.replace("Z", "+00:00")). One of them (the kubectl
log-line parser) also carried a manual fractional-second truncation block that
worked around a pre-3.11 fromisoformat limitation. On the supported Python
range (>=3.11,<3.13) fromisoformat truncates sub-microsecond fractions itself,
so that block is dead.
Centralize the parse into parse_k8s_timestamp in k8s/types.py alongside the
existing parse_k8s_quantity/parse_k8s_cpu helpers and route all call sites
through it.
The physical Write op was the only one carrying a stringly-typed writer_type (plus a schema field) and forcing run_stage to branch on it and import every writer function, contradicting the plan module's stated design that physical ops encapsulate execution as callables. Resolve the writer at plan time via _writer_for() so run_stage just calls op.write_fn, decoupling it from concrete output formats.
rjpower
approved these changes
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Four independent, behavior-preserving cleanups, one per subproject.
marin/datakit: normalize._make_split_writer duplicated the
part-NNNNN-of-MMMMM.parquet shard-name format string inline twice instead of
using datakit.partition_filename, the canonical helper written for exactly this
purpose. Both the main and dups output paths now route through the helper, so
the naming contract that consolidate's filename-based join depends on lives in
one place. Output paths are identical; tests/datakit/test_normalize.py (15)
passes.
levanter/data: Removed a closed cluster of dead code from _preprocessor.py
(_construct_composite_batch_processor, _CompositeBatchProcessor,
as_record_batch — 116 lines). This composite-transform pathway has been
unreferenced since it was added in 2023; the three symbols only referenced one
another and nothing in the repo or downstream marin consumed them. The
transform machinery still used by sharded_datasource (_MapTransform,
_BatchMapTransform, _TransformedDataset) and the still-imported BatchResult /
dict_from_record_batch are untouched.
iris/k8s: Five k8s call sites independently parsed Kubernetes RFC3339
timestamps with datetime.fromisoformat(s.replace("Z", "+00:00")), and the
kubectl log-line parser additionally carried a manual fractional-second
truncation block working around a pre-3.11 fromisoformat limitation. On the
supported Python range (>=3.11,<3.13) fromisoformat truncates sub-microsecond
fractions natively, so that block was dead. Added parse_k8s_timestamp in
k8s/types.py alongside parse_k8s_quantity/parse_k8s_cpu, routed all call sites
through it, dropped the dead truncation block and a now-unused datetime import,
and added parametrized tests (Z suffix, explicit offset, microsecond,
nanosecond truncation, malformed-input rejection).
zephyr/plan: The physical Write op was the lone violator of the plan module's
design that physical ops encapsulate execution as callables — it carried a
stringly-typed writer_type plus a schema field and forced run_stage into a
4-way if/elif that imported every writer function. A _writer_for(writer_type,
schema) factory now binds the writer to its callable at plan-build time
(functools.partial binds schema for parquet/vortex); Write carries a single
write_fn and run_stage just calls op.write_fn(stream, output_path). The
user-facing WriteOp in dataset.py keeps its Literal-typed writer_type.
Affected test suites pass: iris k8s parsers (18), zephyr plan/dataset/backends/
execution/groupby/writers/optimization (189), datakit normalize (15), levanter
sharded_dataset/newdataset (11).