Skip to content

Improve CBOR request marshalling performance: pooled buffer, zero-copy blobs, and reduced validation overhead#4450

Open
muhammad-othman wants to merge 1 commit into
developmentfrom
muhamoth/DOTNET-8587-optimize-cbor-serilaization
Open

Improve CBOR request marshalling performance: pooled buffer, zero-copy blobs, and reduced validation overhead#4450
muhammad-othman wants to merge 1 commit into
developmentfrom
muhamoth/DOTNET-8587-optimize-cbor-serilaization

Conversation

@muhammad-othman

@muhammad-othman muhammad-othman commented Jun 25, 2026

Copy link
Copy Markdown
Member

Description

Encode directly into a pooled buffer. Generated CBOR marshallers previously called request.Content = writer.Encode(), which allocates a fresh byte[] per request. They now encode into a PooledContentStream (backed by a pooled ArrayPoolBufferWriter<byte>) and set request.ContentStream, matching the JSON (#4409) and XML pattern. The downstream pipeline already understands PooledContentStream, and the buffer is returned to the pool after the request completes.
Pre-size the pooled buffer. The buffer is rented at writer.BytesWritten (the exact encoded size, known up front) so it's allocated at the right size in one shot, avoiding a default-size rent followed by a resize-and-return.

Zero-copy binary blobs. Added CborWriterExtensions.WriteByteString(MemoryStream), which writes the stream's underlying buffer directly via MemoryStream.TryGetBuffer instead of allocating a copy with .ToArray(), falling back to .ToArray() when the buffer isn't publicly visible. This is the main win for large binary payloads.

Lower CBOR validation overhead. CborWriterPool now creates writers with CborConformanceMode.Lax instead of Canonical. AWS CBOR services do not require canonical (sorted) map-key ordering, so Lax is safe and skips the per-request key-sorting work. Verified against AWS services and the RPC v2 CBOR protocol tests.

Allocation-free float encoding. WriteOptimizedNumber(float) now encodes the float32 form directly into a stackalloc buffer via BinaryPrimitives.WriteSingleBigEndian on .NET 8+, removing the per-call byte[5] allocation.

Allocation improvements

  • Per-request allocation is now constant (~1.27 KB) regardless of payload size for map, nested, and shallow-map requests, where before it grew with the payload (e.g. ShallowMap_L 11.91 KB → 1.27 KB, MixedItem_L 4.30 KB → 1.27 KB).
    The reason: the old path allocated a fresh byte[] from writer.Encode() whose size scaled with the request body. The new path encodes into a pooled ArrayPoolBufferWriter<byte> that is rented and returned per request, so it doesn't count as a per-request allocation, which leave only the fixed marshalling overhead (request object, headers, the stream wrapper), which is independent of body size.
  • For large binary payloads the win is in eliminating the LOH-bound byte[]: BinaryData_L dropped ~46% (513.72 KB → 274.78 KB) and no longer triggers Gen2 collections from a per-request large-object allocation. (Residual allocation there is the blob copy itself, only taken when the source stream's buffer isn't directly accessible.)

Mean latency improvements

  • Isolated marshaller: ~20% faster for small/baseline requests, rising to ~42–51% for medium/large map and mixed payloads (e.g. ShallowMap_M −50.7%, MixedItem_L −46.7%, ShallowMap_L −47.7%, Nested_L −37.2%). Large binary requests improved ~23% (BinaryData_L 64.3 µs → 49.2 µs).
  • End-to-end (PutItem): mean improved with payload size ~5% (S), ~18% (M), ~21% (L) (PutItem_L 38.1 µs → 30.2 µs) with allocation also down (35.62 KB → 32.54 KB). The relative gain is smaller than the isolated numbers because E2E time is dominated by transport and response handling, which this change doesn't touch.

(Full before/after tables are in the testing section.)

Motivation and Context

DOTNET-8587

Testing

  • Added WriteByteStringTests covering the zero-copy path, the non-exposable fallback, empty streams, position-independence, and large payloads (output is byte-identical to .ToArray()).
  • Ran the performance tests and get these results
    Before
Method Mean Error StdDev Max P50 P90 P95 Gen0 Gen1 Gen2 Allocated
rpcv2Cbor_PutItemRequest_Baseline 809.6 ns 54.46 ns 53.49 ns 925.7 ns 807.1 ns 863.1 ns 879.2 ns 0.1030 - - 1.27 KB
rpcv2Cbor_PutItemRequest_BinaryData_S 949.2 ns 26.89 ns 30.96 ns 1,013.2 ns 942.9 ns 998.4 ns 999.3 ns 0.1459 - - 1.8 KB
rpcv2Cbor_PutItemRequest_BinaryData_M 6,054.9 ns 211.00 ns 242.98 ns 6,510.2 ns 6,042.2 ns 6,389.7 ns 6,414.5 ns 5.3177 - - 65.3 KB
rpcv2Cbor_PutItemRequest_BinaryData_L 64,340.2 ns 3,952.61 ns 4,229.24 ns 74,048.0 ns 62,927.9 ns 69,070.7 ns 71,332.5 ns 89.4775 89.2944 89.2944 513.72 KB
rpcv2Cbor_PutItemRequest_MixedItem_S 1,321.0 ns 22.71 ns 10.08 ns 1,341.4 ns 1,319.9 ns 1,330.3 ns 1,335.8 ns 0.1068 - - 1.33 KB
rpcv2Cbor_PutItemRequest_MixedItem_M 4,855.6 ns 84.43 ns 30.11 ns 4,889.4 ns 4,859.3 ns 4,884.9 ns 4,887.1 ns 0.1526 - - 1.94 KB
rpcv2Cbor_PutItemRequest_MixedItem_L 13,281.4 ns 261.93 ns 93.41 ns 13,404.8 ns 13,242.1 ns 13,398.5 ns 13,401.7 ns 0.3510 - - 4.3 KB
rpcv2Cbor_PutItemRequest_Nested_M 2,445.0 ns 33.60 ns 8.73 ns 2,456.5 ns 2,446.9 ns 2,452.9 ns 2,454.7 ns 0.1183 - - 1.47 KB
rpcv2Cbor_PutItemRequest_Nested_L 6,057.6 ns 87.88 ns 31.34 ns 6,081.9 ns 6,068.9 ns 6,081.4 ns 6,081.6 ns 0.1526 - - 1.88 KB
rpcv2Cbor_PutItemRequest_ShallowMap_S 1,634.3 ns 46.98 ns 52.22 ns 1,753.4 ns 1,637.0 ns 1,695.8 ns 1,703.9 ns 0.1106 - - 1.37 KB
rpcv2Cbor_PutItemRequest_ShallowMap_M 9,987.8 ns 1,105.66 ns 1,273.28 ns 12,807.0 ns 9,339.1 ns 11,990.0 ns 12,301.1 ns 0.1831 - - 2.29 KB
rpcv2Cbor_PutItemRequest_ShallowMap_L 79,139.4 ns 1,827.05 ns 2,030.76 ns 82,464.2 ns 79,047.4 ns 82,074.0 ns 82,193.6 ns 0.8545 - - 11.91 KB
Method Mean Error StdDev Max P50 P90 P95 Gen0 Gen1 Allocated
rpcV2Cbor_e2e_PutItem_S 22,955.3 ns 312.9 ns 417.8 ns 23,962.9 ns 22,879.4 ns 23,404.3 ns 23,437.2 ns 2.5635 0.8545 32.61 KB
rpcV2Cbor_e2e_PutItem_M 30,175.0 ns 809.8 ns 2,349.5 ns 36,925.2 ns 29,700.8 ns 33,414.5 ns 34,824.5 ns 2.6855 0.8850 33.25 KB
rpcV2Cbor_e2e_PutItem_L 38,136.6 ns 336.8 ns 315.0 ns 38,597.9 ns 38,170.1 ns 38,458.5 ns 38,506.3 ns 2.8687 0.9155 35.62 KB

After

Method Mean Error StdDev Max P50 P90 P95 Gen0 Gen1 Gen2 Allocated
rpcv2Cbor_PutItemRequest_Baseline 648.9 ns 47.28 ns 50.59 ns 755.7 ns 633.1 ns 731.2 ns 751.9 ns 0.1030 - - 1.27 KB
rpcv2Cbor_PutItemRequest_BinaryData_S 728.1 ns 15.31 ns 2.37 ns 730.0 ns 728.8 ns 729.9 ns 729.9 ns 0.1259 - - 1.55 KB
rpcv2Cbor_PutItemRequest_BinaryData_M 4,433.8 ns 99.44 ns 106.40 ns 4,693.1 ns 4,398.4 ns 4,539.1 ns 4,565.4 ns 2.7084 0.0076 - 33.3 KB
rpcv2Cbor_PutItemRequest_BinaryData_L 49,243.1 ns 7,186.55 ns 7,987.83 ns 69,506.8 ns 48,660.4 ns 58,092.2 ns 60,346.7 ns 49.0723 48.9502 48.9502 274.78 KB
rpcv2Cbor_PutItemRequest_MixedItem_S 1,032.1 ns 29.58 ns 31.65 ns 1,109.5 ns 1,018.6 ns 1,079.6 ns 1,088.6 ns 0.1030 - - 1.27 KB
rpcv2Cbor_PutItemRequest_MixedItem_M 2,822.5 ns 44.91 ns 11.66 ns 2,832.1 ns 2,827.2 ns 2,830.3 ns 2,831.2 ns 0.1030 - - 1.27 KB
rpcv2Cbor_PutItemRequest_MixedItem_L 7,073.4 ns 133.26 ns 96.35 ns 7,232.0 ns 7,065.9 ns 7,216.5 ns 7,224.2 ns 0.0992 - - 1.27 KB
rpcv2Cbor_PutItemRequest_Nested_M 1,711.8 ns 33.08 ns 27.62 ns 1,755.6 ns 1,724.1 ns 1,735.9 ns 1,743.8 ns 0.1030 - - 1.27 KB
rpcv2Cbor_PutItemRequest_Nested_L 3,803.0 ns 75.55 ns 80.84 ns 3,933.6 ns 3,774.6 ns 3,928.7 ns 3,931.9 ns 0.0992 - - 1.27 KB
rpcv2Cbor_PutItemRequest_ShallowMap_S 1,123.2 ns 21.12 ns 9.38 ns 1,133.8 ns 1,124.4 ns 1,133.2 ns 1,133.5 ns 0.1030 - - 1.27 KB
rpcv2Cbor_PutItemRequest_ShallowMap_M 4,927.0 ns 95.07 ns 79.38 ns 5,080.3 ns 4,901.4 ns 5,034.8 ns 5,057.7 ns 0.0992 - - 1.27 KB
rpcv2Cbor_PutItemRequest_ShallowMap_L 41,417.6 ns 442.74 ns 196.58 ns 41,716.3 ns 41,429.7 ns 41,640.0 ns 41,678.2 ns 0.0610 - - 1.27 KB
Method Mean Error StdDev Max P50 P90 P95 Gen0 Gen1 Allocated
rpcV2Cbor_e2e_PutItem_S 21,735.0 ns 327.3 ns 321.5 ns 22,273.1 ns 21,834.6 ns 22,107.0 ns 22,187.7 ns 2.5635 0.8545 32.51 KB
rpcV2Cbor_e2e_PutItem_M 24,658.3 ns 450.0 ns 552.7 ns 25,585.7 ns 24,639.3 ns 25,424.4 ns 25,516.9 ns 2.6550 0.8850 32.54 KB
rpcV2Cbor_e2e_PutItem_L 30,158.3 ns 259.6 ns 216.8 ns 30,635.1 ns 30,201.4 ns 30,338.7 ns 30,465.5 ns 2.6550 0.8850 32.54 KB

Dry-runs

  • DotNet Dry-run ID: ac3704dc-e14c-4f17-bc5e-7c9af24d1bdc
    • Pending
    • Completed successfully
    • Failed
  • PowerShell Dry-run ID: d5e6c7b7-f95b-45ca-8f3e-a7f6567093ff
    • Pending
    • Completed successfully
    • Failed

Breaking Changes Assessment

  1. Identify all breaking changes including the following details:
    • What functionality was changed?
    • How will this impact customers?
    • Why does this need to be a breaking change and what are the most notable non-breaking alternatives?
    • Are best practices being followed?
    • How have you tested this breaking change?
  2. Has a senior/+ engineer been assigned to review this PR?

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have read the README document
  • I have added tests to cover my changes
  • All new and existing tests passed

License

  • I confirm that this pull request can be released under the Apache 2 license

@muhammad-othman muhammad-othman requested review from a team as code owners June 25, 2026 01:01
@muhammad-othman muhammad-othman requested review from AlexDaines, Copilot and dscpinheiro and removed request for a team June 25, 2026 01:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves CBOR request marshalling performance across generated marshallers and the CBOR extensions/runtime by reducing per-request allocations (pooled request body buffer, avoiding MemoryStream.ToArray() copies), lowering CBOR conformance overhead, and optimizing float encoding.

Changes:

  • Add an initialCapacity constructor to PooledContentStream to rent a right-sized pooled buffer when the encoded size is known up front.
  • Update CBOR request/structure marshaller templates to (a) encode into PooledContentStream (non-.NET Framework) and (b) write MemoryStream blobs without allocating a copy.
  • Switch CborWriterPool to CborConformanceMode.Lax, add a WriteByteString(MemoryStream) overload + tests, and optimize float32 encoding for .NET 8+.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
sdk/src/Core/Amazon.Runtime/Internal/PooledContentStream.cs Adds a pre-sized pooled-stream constructor to support right-sized buffer renting.
generator/ServiceClientGeneratorLib/Generators/Marshallers/CborStructureMarshaller.tt Updates generated CBOR structure marshalling for MemoryStream blobs to avoid .ToArray() allocations.
generator/ServiceClientGeneratorLib/Generators/Marshallers/CborStructureMarshaller.cs Regenerated T4 output reflecting MemoryStream blob marshalling changes.
generator/ServiceClientGeneratorLib/Generators/Marshallers/CborRequestMarshaller.tt Updates generated CBOR request marshalling to encode into PooledContentStream on non-.NET Framework TFMs.
generator/ServiceClientGeneratorLib/Generators/Marshallers/CborRequestMarshaller.cs Regenerated T4 output reflecting pooled-buffer encoding changes.
generator/.DevConfigs/917b930a-b60b-4e06-909f-586826f59759.json DevConfig entries for Core + Extensions.CborProtocol patch releases and changelog messages.
extensions/test/CborProtocol.Tests/WriteByteStringTests.cs Adds test coverage for the WriteByteString(MemoryStream) zero-copy/fallback behavior.
extensions/src/AWSSDK.Extensions.CborProtocol/Internal/CborWriterPool.cs Switches the pool to create writers using CborConformanceMode.Lax.
extensions/src/AWSSDK.Extensions.CborProtocol/CborWriterExtensions.cs Adds WriteByteString(MemoryStream) overload and allocation-free float32 encoding on .NET 8+.

Comment thread sdk/src/Core/Amazon.Runtime/Internal/PooledContentStream.cs
@muhammad-othman muhammad-othman force-pushed the muhamoth/DOTNET-8587-optimize-cbor-serilaization branch from e8b31b5 to d348a90 Compare June 25, 2026 01:20
@boblodgett boblodgett self-requested a review June 25, 2026 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants