Skip to content

Add --stream mode for in-place file processing#2

Open
joehoyle wants to merge 6 commits intotrunkfrom
add/stream-mode
Open

Add --stream mode for in-place file processing#2
joehoyle wants to merge 6 commits intotrunkfrom
add/stream-mode

Conversation

@joehoyle
Copy link
Member

@joehoyle joehoyle commented Feb 10, 2026

Summary

  • Adds --stream flag (accepts NO_STREAM or NO_STREAM_AND_NO_DELETE) that reads mydumper marker lines from stdin and performs search-and-replace on the already-written data files in-place
  • Adds --forward flag to re-emit stdin lines to stdout after processing, enabling piping to myloader
  • Refactors existing logic into runSplitMode() with no behavior changes

Usage

# Process files in-place only
mydumper --stream NO_STREAM -o /data | \
  go-search-replace-mydumper --stream=NO_STREAM /data from1 to1

# Process and forward to myloader
mydumper --stream NO_STREAM_AND_NO_DELETE -o /data | \
  go-search-replace-mydumper --stream=NO_STREAM_AND_NO_DELETE --forward /data from1 to1 | \
  myloader --stream

Test plan

  • go vet and go build pass
  • Stream mode processes data files in-place correctly
  • --forward re-emits markers to stdout
  • Non-data files (schema files) are skipped
  • Invalid --stream values are rejected
  • Split mode (default) still works unchanged

🤖 Generated with Claude Code

When mydumper runs with --stream NO_STREAM or NO_STREAM_AND_NO_DELETE,
it writes files to disk and outputs markers to stdout. This new mode
reads those markers from stdin, performs search-and-replace on the
data files in-place, and optionally forwards the marker stream to
stdout for piping to myloader via --forward.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@joehoyle
Copy link
Member Author

joehoyle commented Mar 2, 2026

Let's add a benchmarking test here to compare stream out vs file in-place replacement

joehoyle and others added 5 commits March 2, 2026 13:27
The containsRegex.Match() call on every line was expensive for the
common case of literal string replacements. bytes.Contains is much
cheaper and produces identical results since replacement "from" values
are always literal byte sequences, not regex patterns.

Also makes runStreamMode testable by accepting an io.Reader parameter
instead of reading os.Stdin directly, and adds benchmarks with cached
test data generation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stream mode now processes files concurrently using a worker pool.
Defaults to runtime.NumCPU() workers, configurable via --workers flag.
Workers pick files from a channel as markers arrive on stdin, so
multiple files can be processed simultaneously on multi-core systems.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, markers were forwarded to stdout immediately when read
from stdin, before the worker finished processing the file. This
could cause downstream consumers (myloader) to read a file still
being modified. Now workers send markers to a forwarding goroutine
only after processFileInPlace completes successfully.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend dataFileRegex to match both .sql and .dat extensions so files
produced by mydumper's LOAD_DATA format are also processed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The forwarding goroutine now emits lines in the original stdin order
by waiting on per-item done channels. Workers still process files in
parallel, but the forwarder blocks until each item (in order) is
complete before writing to stdout. Non-data lines are forwarded
immediately via a pre-closed channel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant