feat(gcp_cloud_storage sink): support Parquet batch encoding#25590
Open
dshmatov wants to merge 6 commits into
Open
feat(gcp_cloud_storage sink): support Parquet batch encoding#25590dshmatov wants to merge 6 commits into
dshmatov wants to merge 6 commits into
Conversation
Add Apache Parquet columnar batch encoding to the `gcp_cloud_storage` sink, matching the existing `aws_s3` sink capability. Enable it via `batch_encoding.codec = "parquet"`. When `batch_encoding` is set, events are encoded together as a batch in the columnar Parquet format instead of the standard per-event, framing-based encoding. Parquet handles its own compression internally (configurable via `batch_encoding.compression`), so the top-level `compression` setting is bypassed (with a warning if set), the object `Content-Type` defaults to `application/vnd.apache.parquet`, and the filename extension defaults to `parquet`. This reuses the shared codec/batch-encoder infrastructure already used by the `aws_s3` sink; the change is purely the sink-side wiring plus the `GcsBatchEncoding` config enum.
Contributor
|
All contributors have signed the CLA ✍️ ✅ |
Author
|
I have read the CLA Document and I hereby sign the CLA |
…ct list - 'vnd' is introduced by this PR (the application/vnd.apache.parquet MIME type) - 'deser' comes from the upstream 'deser_failed' buffer metric in internal_metrics.cue, which the PR's merge with master surfaces
drichards-87
approved these changes
Jun 8, 2026
drichards-87
left a comment
Contributor
There was a problem hiding this comment.
Left a couple of suggestions from Docs and approved the PR.
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Apache Parquet columnar batch encoding to the
gcp_cloud_storagesink, matching the capability theaws_s3sink already has. Enable it withbatch_encoding.codec = "parquet".When
batch_encodingis set, events are encoded together as a batch into a single Parquet file per batch (instead of the standard per-event, framing-based encoding). Parquet handles its own compression internally, so:compressionsetting is bypassed (a warning is logged if it was set),Content-Typedefaults toapplication/vnd.apache.parquet,parquet.This reuses the shared codec/batch-encoder infrastructure introduced for
aws_s3(theBatchEncoder/BatchSerializerConfig/EncoderKindmachinery inlib/codecsandsrc/sinks/util). The change is purely the sink-side wiring plus aGcsBatchEncodingconfig enum — no new encoding logic.Vector configuration
Schema-file mode is also supported:
How did you test this PR?
cargo test --features "sinks-gcp,codecs-parquet" --lib gcp::cloud_storage— 9 new Parquet tests + the 15 existing GCS sink tests pass.schema_modeparsing/defaults, content-type auto-detection + user override,.parquetextension default + override, top-level-compression bypass, rejection of non-parquet codecs.parquet_encodes_valid_fileruns a real batch through the sink's request builder and asserts the output is a valid Parquet file (PAR1magic bytes, correct row count, inferredmessage/hostcolumns) using theparquetreader.codecs-parquetboth enabled and disabled (the feature-gated paths).make check-clippy,make fmt, andmake check-generated-docs(regeneratedgcp_cloud_storage.cue) pass.Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References