Skip to content

feat(gcp_cloud_storage sink): support Parquet batch encoding#25590

Open
dshmatov wants to merge 6 commits into
vectordotdev:masterfrom
dshmatov:feat/gcs-parquet-batch-encoding
Open

feat(gcp_cloud_storage sink): support Parquet batch encoding#25590
dshmatov wants to merge 6 commits into
vectordotdev:masterfrom
dshmatov:feat/gcs-parquet-batch-encoding

Conversation

@dshmatov

@dshmatov dshmatov commented Jun 7, 2026

Copy link
Copy Markdown

Summary

Adds Apache Parquet columnar batch encoding to the gcp_cloud_storage sink, matching the capability the aws_s3 sink already has. Enable it with batch_encoding.codec = "parquet".

When batch_encoding is set, events are encoded together as a batch into a single Parquet file per batch (instead of the standard per-event, framing-based encoding). Parquet handles its own compression internally, so:

  • the top-level compression setting is bypassed (a warning is logged if it was set),
  • the object Content-Type defaults to application/vnd.apache.parquet,
  • the filename extension defaults to parquet.

This reuses the shared codec/batch-encoder infrastructure introduced for aws_s3 (the BatchEncoder / BatchSerializerConfig / EncoderKind machinery in lib/codecs and src/sinks/util). The change is purely the sink-side wiring plus a GcsBatchEncoding config enum — no new encoding logic.

Vector configuration

sources:
  demo:
    type: demo_logs
    format: json

sinks:
  gcs_out:
    type: gcp_cloud_storage
    inputs: [demo]
    bucket: my-bucket
    encoding:
      codec: text
    batch_encoding:
      codec: parquet
      schema_mode: auto_infer
      compression:
        algorithm: snappy

Schema-file mode is also supported:

    batch_encoding:
      codec: parquet
      schema_mode: strict
      schema_file: /etc/vector/schema.parquet
      compression:
        algorithm: zstd
        level: 10

How did you test this PR?

  • cargo test --features "sinks-gcp,codecs-parquet" --lib gcp::cloud_storage — 9 new Parquet tests + the 15 existing GCS sink tests pass.
    • Config-level: TOML/YAML shape, schema_mode parsing/defaults, content-type auto-detection + user override, .parquet extension default + override, top-level-compression bypass, rejection of non-parquet codecs.
    • End-to-end encoding: parquet_encodes_valid_file runs a real batch through the sink's request builder and asserts the output is a valid Parquet file (PAR1 magic bytes, correct row count, inferred message/host columns) using the parquet reader.
  • Verified the crate compiles with codecs-parquet both enabled and disabled (the feature-gated paths).
  • make check-clippy, make fmt, and make check-generated-docs (regenerated gcp_cloud_storage.cue) pass.

Note: this repo currently has no GCS-cloud-storage integration-test harness (no fake-gcs-server service under scripts/integration/), so Parquet output is validated at the unit level through the real encode path rather than against a live backend. Happy to add an integration harness if maintainers would prefer one.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Add Apache Parquet columnar batch encoding to the `gcp_cloud_storage`
sink, matching the existing `aws_s3` sink capability. Enable it via
`batch_encoding.codec = "parquet"`.

When `batch_encoding` is set, events are encoded together as a batch in
the columnar Parquet format instead of the standard per-event,
framing-based encoding. Parquet handles its own compression internally
(configurable via `batch_encoding.compression`), so the top-level
`compression` setting is bypassed (with a warning if set), the object
`Content-Type` defaults to `application/vnd.apache.parquet`, and the
filename extension defaults to `parquet`.

This reuses the shared codec/batch-encoder infrastructure already used
by the `aws_s3` sink; the change is purely the sink-side wiring plus the
`GcsBatchEncoding` config enum.
@dshmatov dshmatov requested review from a team as code owners June 7, 2026 23:37
@github-actions github-actions Bot added docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: sinks Anything related to the Vector's sinks domain: external docs Anything related to Vector's external, public documentation and removed docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. labels Jun 7, 2026
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@dshmatov

dshmatov commented Jun 7, 2026

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

dshmatov added 2 commits June 8, 2026 01:54
…ct list

- 'vnd' is introduced by this PR (the application/vnd.apache.parquet MIME type)
- 'deser' comes from the upstream 'deser_failed' buffer metric in
  internal_metrics.cue, which the PR's merge with master surfaces
@github-actions github-actions Bot added docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: ci Anything related to Vector's CI environment and removed docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. labels Jun 7, 2026
@drichards-87 drichards-87 self-assigned this Jun 8, 2026

@drichards-87 drichards-87 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of suggestions from Docs and approved the PR.

Comment thread website/cue/reference/components/sinks/generated/gcp_cloud_storage.cue Outdated
Comment thread website/cue/reference/components/sinks/generated/gcp_cloud_storage.cue Outdated
Comment thread website/cue/reference/components/sinks/generated/gcp_cloud_storage.cue Outdated
@drichards-87 drichards-87 removed their assignment Jun 8, 2026
dshmatov and others added 2 commits June 8, 2026 19:24
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
@github-actions github-actions Bot added the docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. label Jun 8, 2026
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants