Skip to content

feat(azure_blob sink): add append blob support via blob_type option#25627

Draft
Danielku15 wants to merge 3 commits into
vectordotdev:masterfrom
Danielku15:feature/append-blob
Draft

feat(azure_blob sink): add append blob support via blob_type option#25627
Danielku15 wants to merge 3 commits into
vectordotdev:masterfrom
Danielku15:feature/append-blob

Conversation

@Danielku15

Copy link
Copy Markdown

Summary

Note

This PR overlaps with with #25545
I kept this PR as draft to update it with the tag support once the other PR is merged. append blobs will also need tagging/metadata support. Feel free to review this PR already.

Adds blob_type: append support to the azure_blob sink, implementing #19397.

The default behavior (blob_type: block) is unchanged. When blob_type: append is set, each flush appends to a stable-named Azure Append Blob rather than creating a new uniquely-named blob per batch. This is the natural model for continuous log streaming where you want a single growing file per time window.

Key design decisions:

  • Type-aware defaults: blob_time_format defaults to %Y-%m-%d (daily rotation) and blob_append_uuid defaults to false for append type, matching the expected append use case. Both can still be overridden.
  • EAFP append flow: try append_block first (hot path = 1 API call for an existing blob), create the blob on 404, retry. A 409 Conflict on create is swallowed — it means a concurrent writer created the blob first.
  • Azure hard limit enforcement: batch.max_bytes is automatically defaulted to 4 MiB when blob_type: append is configured and the setting is not explicitly set. Values above 4 MiB are rejected at startup with a clear error.
  • Compressed append blobs produce concatenated compressed frames (one per batch). Decompressors that support multi-stream formats (gunzip, zstd -d) handle these correctly.

Vector configuration

Minimal append blob configuration:

sinks:
  my_append_logs:
    type: azure_blob
    inputs: [...]
    connection_string: "DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"
    container_name: "logs"
    blob_prefix: "app/%F/"
    blob_type: append
    encoding:
      codec: json

This produces a single blob per day (e.g. app/2024-07-18/2024-07-18.log) and appends each batch to it. Defaults are: blob_time_format: "%Y-%m-%d", blob_append_uuid: false, batch.max_bytes: 4194304.

Explicit batch size and custom rotation:

sinks:
  my_append_logs:
    type: azure_blob
    inputs: [...]
    connection_string: "..."
    container_name: "logs"
    blob_prefix: "app/"
    blob_type: append
    blob_time_format: "%Y-%m-%d-%H"   # hourly rotation
    batch:
      max_bytes: 2097152              # 2 MiB per block
    encoding:
      codec: json

How did you test this PR?

  • Unit tests (cargo test --no-default-features --features sinks-azure_blob sinks::azure_blob): 27 tests pass, including new tests for:
    • Append blob request building with daily time format and no UUID
    • Compressed append blob request building
    • Stable key generation (no UUID, empty time format)
    • UUID override in append mode
    • Hourly rotation via custom blob_time_format
    • Default blob_type is block
    • Config parsing of blob_type = "append"
    • blob_type: append with no explicit batch.max_bytes succeeds at startup (C1 regression test)
    • blob_type: append with batch.max_bytes > 4 MiB fails at startup with a max_bytes exceeds error
    • Direct validate().limit_max_bytes() rejection for oversized values
  • Integration tests (cargo vdev int test azure): tested against Azurite (local Azure emulator) covering:
    • Append blob reuses the same blob across multiple batches
    • Content ordering is preserved across flushes
    • JSON encoding with content-type verification
    • Default daily rotation (blob name changes at day boundary)
    • Multiple forced flushes all land in a single blob

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

@github-actions github-actions Bot added the domain: sinks Anything related to the Vector's sinks label Jun 15, 2026
@github-actions github-actions Bot added domain: external docs Anything related to Vector's external, public documentation docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. labels Jun 15, 2026
@Danielku15

Copy link
Copy Markdown
Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5734e634c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sinks/azure_blob/config.rs Outdated
Comment on lines +319 to +320
if self.blob_type == AzureBlobType::Append && batch.max_bytes.is_none() {
batch.max_bytes = Some(APPEND_BLOB_MAX_BLOCK_BYTES);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect omitted append batch max_bytes

For append configs that set any other [batch] field but omit max_bytes (for example batch.timeout_secs), BatchConfig's per-field serde default has already populated batch.max_bytes with the bulk default of 10_000_000. This condition is therefore false, and the later limit_max_bytes(4 MiB) rejects the config at startup even though the user did not explicitly configure max_bytes, contradicting the new append-mode default.

Useful? React with 👍 / 👎.

Comment on lines +64 to +71
AzureBlobType::Append => {
append_blob(
&blob_client.append_blob_client(),
request.blob_data,
request.content_type,
request.content_encoding,
)
.await

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize append writes per blob

When blob_type = append, these calls still go through the existing sink driver with configurable/adaptive request concurrency, so two batches targeting the same partition can execute append_blob at the same time. Azure orders appended blocks by the order the service receives them, not the original event order, so under request.concurrency > 1 (or after adaptive concurrency ramps up) flushes to the same blob can be persisted out of order; append mode needs per-blob serialization or an equivalent ordering guard.

Useful? React with 👍 / 👎.

Comment on lines +125 to +128
match append_client
.append_block(RequestContent::from(data.to_vec()), data_len, None)
.instrument(info_span!("request").or_current())
.await

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid retrying non-idempotent appends

This append mutates the blob, but the request is still wrapped in the existing retry policy, which retries timeouts and Azure 5xx/429 errors by cloning and replaying the same AzureBlobRequest. If Azure commits the block and the client then observes a timeout or transient response error, the retry path appends the same batch again; append mode needs an append-position/ETag condition or retries disabled for unsafe cases.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overstated. Vector is explicitly at-least-once (guarantees.md: duplicates "possible," exactly-once unsupported). No Vector sink implements append-position/ETag idempotency — CloudWatch's sequence token is service-provided, not a replicable pattern. Implementing Azure append_position needs stateful per-blob offset tracking, incompatible with the stateless service + partitioned stream driver, and inconsistent with every peer sink. Also the retry logic only retries 5xx/429 (not None/timeouts — config.rs:474), so the window is narrow (Azure commits, then returns 5xx).

Plan: document in the blob_type field docs + changelog that append is at-least-once and a retried flush may re-append on rare server errors; note request.retry_attempts = 0 for at-most-once. I'd push back on building idempotency here as out of scope.

content_type: &str,
content_encoding: Option<&str>,
) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
let data_len = data.len() as u64;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce the limit on encoded append payloads

The startup check limits the batcher's pre-encoding event size, but Azure enforces the append-block limit on the serialized/compressed body length used here. With blob_type: append and JSON logs containing many escapable characters (or gzip overhead around an incompressible near-limit batch), Vector can produce data.len() > 4 MiB and send it to append_block, causing Azure to reject the whole batch despite the config validation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent with the dominant pattern. Kinesis firehose (4 MB), streams (5 MB), pubsub (10 MB), loki (1 MB), cloudwatch batch (1 MB) all rely solely on pre-encoding batch.max_bytes. With the default gzip, encoded ≤ pre-encoding, so 4 MiB pre-encoding is headroom. Risk only with compression disabled + escape-heavy JSON.

Comment on lines +149 to +151
/// - `append`: each batch appends to the same blob.
/// `blob_append_uuid` defaults to `false`; `blob_time_format` defaults to `%Y-%m-%d`.
/// Multiple batches within the same time window write to the same blob.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Document SAS permissions for appends

For SAS-authenticated configs, the existing connection-string guidance only tells users to grant Read/Create permissions, which can pass healthcheck and blob creation but is insufficient for Append Block. This new append option should document that SAS tokens also need Add or Write permission; otherwise users following the component docs get 403s on every flush in append mode.

Useful? React with 👍 / 👎.

Comment thread src/sinks/azure_blob/test.rs Outdated
let request_metadata = request_metadata_builder.build(&payload);
let request = request_options.build_request(metadata, request_metadata, payload);

let expected_date = Utc::now().format("%Y-%m-%d").to_string();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Capture one timestamp in date-sensitive tests

This assertion calls Utc::now() after build_request, which already formatted its key using an earlier Utc::now(). If the test runs across a UTC date boundary between those calls, the generated key legitimately contains the previous day while the expected value uses the new day, creating a rare but avoidable flaky failure; capture the time window once or inject a clock into the request builder test.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AppendBlob support for azure_blob sink

1 participant