You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(output)!: unify compression behind a Codec abstraction (zstd/gzip/none)
The three sinks that compress (file, http, s3) each constructed their own
zstd::Encoder, hardcoded `.zst`, and hardcoded `Content-Encoding: zstd`.
That blocked gzip/plaintext output and meant a user changing the codec
would also have to fix the file extension and the wire header by hand.
This change pulls compression behind one `Codec` enum that owns format,
level, file extension, and wire encoding name. The three per-sink
`compression_level: Option<i32>` fields are replaced by one shared
`compression: { format, level }` block; the s3 sink's
`{prefix_hash}-{seq}` key template and the file sink's hardcoded
`{prefix}.zst` are unified around a `path_template` / `key_template`
that supports `{ext}` (codec-derived) plus the existing placeholders.
Templates that don't carry `{prefix}` or `{prefix_hash}` (and `{seq}`
when `batch_max_mb` is set on s3) are rejected at startup; a runtime
guard catches residual collisions — fatal for the file sink, warn for
s3.
The s3 batching docs are rewritten to describe the actual lifecycle —
per-prefix encoder, threshold-driven mid-run flush, end-of-run flush —
and the buffering/concurrency caveats that make `batch_max_mb` behave
unintuitively.
Breaking: bare `compression_level` in YAML and `--s3-output-compression-level`
on the CLI are removed; use the new `compression` block /
`--compression-format` + `--compression-level` instead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CLAUDE.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# bucket-scrapper
2
2
3
-
S3 bucket content searcher. Downloads compressed objects from S3, stream-decompresses, filters lines by regex, and routes matches to one of four pluggable output sinks: local zstd files, an HTTP API (NDJSON), an S3 bucket, or `/dev/null` (void).
3
+
S3 bucket content searcher. Downloads compressed objects from S3, stream-decompresses, filters lines by regex, and routes matches to one of four pluggable output sinks: local files, an HTTP API (NDJSON), an S3 bucket, or `/dev/null` (void). The file/http/s3 sinks share a single `Codec` abstraction (zstd / gzip / none) so the on-disk extension, the wire `Content-Encoding`, and the encoder bytes stay in lockstep.
NDJSON-zstd POSTs to an HTTP endpoint, with adaptive (AIMD) throttling and 429 back-off:
125
+
NDJSON POSTs to an HTTP endpoint with adaptive (AIMD) throttling and 429 back-off. `Content-Encoding` is set automatically from the codec (`zstd` / `gzip`) or omitted entirely when `compression.format = none`:
Per-prefix zstd objects written to a destination S3 bucket (works with non-AWS backends — Garage, MinIO — via `endpoint_url`). By default each source prefix collapses to exactly one output object (N:1, same shape as the file sink). Set `batch_max_mb` to opt into size-based rollover within a prefix:
148
+
Per-prefix objects written to a destination S3 bucket. Works with non-AWS backends (Garage, MinIO) via `endpoint_url`.
# batch_max_mb: 16 # opt into batched uploads — see below
156
+
# compression: { format: zstd, level: 3 }
154
157
```
155
158
156
-
Output mapping summary (file and S3 sinks):
159
+
#### Batching model
160
+
161
+
The s3 sink uploads **batches**. One batch is one `PutObject` call. Every batch carries lines from a single source prefix; there is no cross-prefix mixing or consolidation. The configurable axis is *how many batches per source prefix*.
162
+
163
+
**Per-prefix encoder lifecycle.** When the first matched line for a source prefix arrives, the sink lazily creates an in-memory encoder for that prefix (zstd / gzip / identity, per `compression.format`). Subsequent matched lines for the same prefix are written into that encoder. A prefix that produces no matches has no encoder and emits no object. Different prefixes have independent encoders and never block each other.
164
+
165
+
**Default mode (`batch_max_mb` unset).** Each prefix's encoder is finalized exactly once, at end-of-run. The finalized buffer is enqueued for upload. Result: one output object per source prefix, with `{seq}` always rendering to `00000`. This is the same N:1 shape as the file sink.
166
+
167
+
**Batched mode (`batch_max_mb` set).** After every line ingested for a prefix, the sink reads the **compressed bytes already emitted into that prefix's encoder output buffer** and compares it to `batch_max_mb`. When the buffer crosses the threshold:
168
+
169
+
1. The encoder is finalized in place, producing a finished compressed frame.
170
+
2. That frame is rendered into a destination key (`{seq}` substituted, then `{seq}` is incremented for the next batch in this prefix) and pushed onto the bounded upload queue.
171
+
3. A fresh encoder is constructed for the same prefix; subsequent lines start filling it.
172
+
173
+
End-of-run runs an unconditional flush over every prefix that still has a non-empty encoder, producing a final batch (the trailing partial). Each prefix's `{seq}` therefore yields a contiguous `00000`, `00001`, … sequence whose count depends only on how many threshold crossings happened plus one closing flush.
174
+
175
+
**Why upload size can exceed `batch_max_mb`.** The threshold check sees only the compressed bytes the encoder has already flushed to its output buffer — it doesn't include lines still buffered inside the codec, nor does it preemptively split a line that pushes the buffer over the line. In practice each batch lands a little above the threshold rather than at it.
176
+
177
+
**Why a small `batch_max_mb` may not produce extra batches.** zstd and gzip both buffer internally and only emit compressed bytes when they have enough data to encode efficiently. A trickle of small, highly-compressible lines can keep the encoder's *output* buffer at zero for a long time even though many lines have been ingested — so no threshold crossing fires, and end-of-run produces a single batch. To exercise batched mode you need either enough input volume per prefix to force several internal block flushes, or `compression.format: none` (where the output buffer grows monotonically with input size).
178
+
179
+
**Concurrency and ordering.** Batches go onto a bounded queue drained by N uploader tasks (`upload_tasks`, defaults to ~`cpu/4`). Uploaders run concurrently, so batches within a prefix can land out of order on S3. `{seq}` reflects the order in which the sink *finalized* the batch, not the order S3 finished receiving it. End-of-run waits for the queue to drain before reporting completion.
180
+
181
+
**Failure model.** Recoverable upload errors (transient 5xx, throttling) surface as `error!` logs but don't stop the pipeline; the run continues with the next batch. Non-recoverable errors (malformed request, auth failure) flip the sink's fatal flag, the download coordinator stops feeding new work, and the run aborts. Lost batches are counted into `OutputStats.lines_dropped`. Multipart uploads are not yet implemented — every batch is a single `PutObject`, so a batch must fit AWS's 5 GB single-PUT cap (the `multipart_threshold_mb` / `multipart_part_mb` config fields are accepted but currently informational, with a startup warning if you set them away from defaults).
182
+
183
+
**`{seq}` is required when `batch_max_mb` is set.** Without it, every batch within a prefix would render to the same key and silently overwrite the previous one. Startup rejects such configs.
184
+
185
+
**`{seq}` and reruns.** `{seq}` is an in-memory counter — it resets to 0 on every process start. Two runs that hit the same prefix and template produce overlapping `{seq}` values; rely on `{run_id}` (unique per process invocation) to disambiguate them in the destination key.
186
+
187
+
#### Output mapping summary
157
188
158
189
| Sink | Default | With `batch_max_mb` set |
159
190
|------|---------|-------------------------|
160
-
| `file` | one `.zst` file per source prefix (always) | n/a — file sink has no rollover |
161
-
| `s3` | one object per source prefix| one or more objects per prefix, rolling over at the threshold |
191
+
| `file` | one file per source prefix (always) | n/a — file sink has no batched uploads |
192
+
| `s3` | one object per source prefix, `{seq}=00000` | one or more objects per prefix; new batch finalized whenever the encoder's compressed output buffer crosses `batch_max_mb`, plus one trailing batch at end-of-run |
162
193
163
194
Cross-prefix consolidation (e.g. one daily file across all hours) is not currently supported.
164
195
@@ -171,6 +202,42 @@ outputs:
171
202
- type: void
172
203
```
173
204
205
+
### Codecs
206
+
207
+
The `file`, `http`, and `s3` sinks share a single compression block:
208
+
209
+
```yaml
210
+
compression:
211
+
format: zstd | gzip | none # default: zstd
212
+
level: <int> # codec-specific; omit for the codec default
| `{prefix_hash}` | 8-char hex hash of the prefix — useful when the raw prefix has awkward characters |
231
+
| `{run_id}` | 8-char hex unique to this process invocation |
232
+
| `{seq}` | Per-prefix zero-padded sequence number — **s3 only**, incremented on each batch finalize when `batch_max_mb` is set |
233
+
| `{ext}` | Codec extension (`zst`, `gz`, or empty); the leading `.` is dropped when empty |
234
+
235
+
The template **must contain `{prefix}` or `{prefix_hash}`** — without one, every source prefix renders to the same destination. For the file sink this is fatal (two encoders writing to one file would corrupt the output) and is rejected at startup. For the s3 sink it's also rejected at startup, with a runtime warn-and-overwrite as defence in depth against `{prefix_hash}` collisions.
236
+
237
+
When `batch_max_mb` is set on the s3 sink, the template must additionally contain `{seq}`; otherwise every batch within a prefix would render to the same key and overwrite the previous — also a startup error.
238
+
239
+
Anti-pattern (rejected): `path_template: "results.{ext}"`— same path for every prefix.
240
+
174
241
## AWS Authentication
175
242
176
243
Standard AWS SDK credential chain: environment variables, `~/.aws/credentials`, IAM role, or `aws sso login`. Custom CA bundles via `AWS_CA_BUNDLE`.
0 commit comments