Skip to content

Json: Fixes unbounded recursion in binary value-length decoder to prevent DoS#5909

Open
tvaron3 wants to merge 6 commits into
mainfrom
tvaron3/fix-binary-json-recursion
Open

Json: Fixes unbounded recursion in binary value-length decoder to prevent DoS#5909
tvaron3 wants to merge 6 commits into
mainfrom
tvaron3/fix-binary-json-recursion

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented May 28, 2026

Description

Hardens the Microsoft.Azure.Cosmos binary JSON code paths against malformed / hostile payloads that could crash the host process with an unrecoverable StackOverflowException. The originally reported attack is JsonBinaryEncoding.ValueLengths.GetValueLength recursively decoding Arr1 (0xE1) and Obj1 (0xE9) type markers with no depth check — a ~78 KB payload starting with 0x80 was enough to abort the process with SIGABRT.

Note: this requires the attacker to control the response bytes (MitM, rogue / hijacked endpoint, misconfigured emulator URL), so it only matters after auth. Severity is correspondingly low, but the fix is straightforward and brings the binary decoder in line with the streaming reader's existing 256-level nesting policy.

While auditing the rest of the binary path for the same shape of bug, a few related gaps were found and fixed in the same PR:

  1. JsonBinaryWriter.ForceRewriteRawJsonValue recursively re-serializes nested arrays / objects with no stack-budget check, and follows resolved-reference-string (StrR1StrR4) offsets read straight from the input buffer without validating that the target is in range or is itself a plain string. A crafted payload could either stack-overflow the writer or cause an out-of-range / cycle traversal.

  2. CosmosElement walker stack-budget guards (defense-in-depth). The decoder cap is 256 nested levels, which is the right cap to let legitimate payloads through. But on a small-stack worker thread (ASP.NET request thread, Azure Functions, container default ~512KB, etc.), a tree the decoder correctly lets through at depth 256 can still exhaust the stack inside a downstream recursive walker, which is also an unrecoverable crash. RuntimeHelpers.EnsureSufficientExecutionStack() is budget-based rather than depth-based, so adding it at the top of each recursive Visit / Equals / GetHashCode converts the crash into a catchable InsufficientExecutionStackException without imposing a hard depth limit that would break customers with legitimately deep data.

    Of the walkers covered: DeserializationVisitor (e.g. ReadItemAsync<T> / query-result materialization) is the only one with a realistic standalone MitM attack chain — it walks attacker-influenced bytes on the caller's thread. The rest (CosmosElementToQueryLiteral, CosmosElementToSqlScalarExpressionVisitor, DistinctHash, CosmosArray / CosmosObject Equals / GetHashCode) are defense-in-depth: they're cheap (a handful of nanoseconds per call) and they compose — the walkers call into each other, so guarding only the entry point would leave gaps as the code evolves.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Changes

Decoder (original report)

  • JsonBinaryEncoding.ValueLengths.cs: add a depth parameter to the private GetValueLength overload and throw JsonMaxNestingExceededException when depth >= JsonObjectState.JsonMaxNestingDepth (256). The public single-arg entry point delegates with depth: 0; the Arr1 / Obj1 recursive calls pass depth + 1.
  • JsonObjectState.cs: promote JsonMaxNestingDepth from private const to internal const and have Push() throw JsonMaxNestingExceededException instead of InvalidOperationException, so both depth-cap paths share one internal exception type. JsonMaxNestingExceededException already derives from JsonParseException, so external callers still see the normal SDK exception hierarchy.
  • JsonBinaryEncoding.cs::TryGetValueLength: widen the catch from JsonParseException to Exception so malformed payloads honor the Try-pattern contract by returning false instead of leaking IndexOutOfRangeException / ArgumentOutOfRangeException / etc. GetValueLength is a pure function over a ReadOnlySpan<byte> with no side effects, so swallowing is safe. (Also removed the now-misleading [MethodImpl(AggressiveInlining)] — methods with try/catch can't be inlined by the JIT.)
  • JsonNavigator.cs / JsonReader.cs: document the new exception on the public Create overloads via <exception cref="JsonMaxNestingExceededException">.

Binary writer / StrR reference-string hardening

  • JsonWriter.JsonBinaryWriter.cs::ForceRewriteRawJsonValue: call RuntimeHelpers.EnsureSufficientExecutionStack() at the top so deep nested arrays / objects throw a catchable InsufficientExecutionStackException instead of crashing the process. (EnsureSufficientExecutionStack is budget-based; the decoder-side JsonMaxNestingExceededException cap at 256 remains the primary depth guard.)
  • New helper RewriteResolvedReferenceString replaces the four open-coded StrR1 / StrR2 / StrR3 / StrR4 call sites. It does an unsigned in-range bounds check on the resolved offset and rejects any target whose type marker is not IsString && !IsReferenceString — matching the writer's own invariant (StrR redirects to a plain string, never to another StrR). This prevents both out-of-range follows and StrR→StrR cycles in attacker-crafted payloads.

CosmosElement walker stack-budget guards

RuntimeHelpers.EnsureSufficientExecutionStack() at the top of each recursive Visit / Equals / GetHashCode so a deep CosmosArray / CosmosObject tree converts to a catchable exception instead of a process abort on small-stack worker threads:

  • JsonSerializer.cs::DeserializationVisitor.Visit(CosmosArray, Type) and Visit(CosmosObject, Type) — realistic MitM chain via ReadItemAsync<T> / query-result deserialization.
  • CosmosElementToQueryLiteral.cs — both Visit overloads (continuation tokens, diagnostic log strings). Defense-in-depth.
  • CosmosElementToSqlScalarExpressionVisitor.cs — both Visit overloads (LINQ → SQL translation of in-memory constants). Defense-in-depth.
  • DistinctHash.cs::Visit(CosmosArray) and Visit(CosmosObject) — DISTINCT query hashing over result items. Defense-in-depth.
  • CosmosArray.cs and CosmosObject.csEquals and GetHashCode. Indirect; called by DistinctHash, partition-key equality, hash-based caches.

Test coverage

  • Microsoft.Azure.Cosmos.Tests FullyQualifiedName~Json filter: 424 passed, 0 failed, 2 skipped locally after merge with upstream/main.
  • Regression tests added in JsonBinaryEncodingValueLengthsTests.cs:
    • Deeply nested Arr1 / Obj1 / ArrL4 payloads throw JsonMaxNestingExceededException / InsufficientExecutionStackException (instead of stack-overflowing) — exercised through JsonBinaryEncoding.GetValueLength, JsonNavigator.Create, JsonReader.Create, CosmosArray.GetHashCode, CosmosElementToQueryLiteral, and DeserializationVisitor.
    • Payload at exactly JsonMaxNestingDepth throws; payload at JsonMaxNestingDepth - 1 still decodes (boundary).
    • TryGetValueLength returns false for deep / empty / truncated payloads.
    • Object-shaped CosmosElementToQueryLiteral deep-nesting case.
    • Deep-walker tests pin a maxStackSize: 256 * 1024 thread so the budget-based guard is deterministic regardless of host main-thread stack size.
  • New JsonBinaryReferenceStringTests.cs (9 tests): each StrR1–StrR4 variant with bad offset, target-is-StrR, and target-out-of-bounds — all surface as catchable parse exceptions.
  • JsonWriterTests.cs: push tests at the nesting cap to confirm the writer still accepts depth-256 nesting.

tvaron3 and others added 2 commits May 28, 2026 10:18
JsonBinaryEncoding.ValueLengths.GetValueLength recursively decoded
Arr1 (0xE1) and Obj1 (0xE9) type markers with no depth check. A
crafted binary payload (first byte 0x80) of ~78 KB nesting Arr1 or
Obj1 markers could exhaust the CLR stack and crash the host with an
unrecoverable StackOverflowException.

Changes:
- Add a depth counter to the internal GetValueLength overload and
  throw JsonMaxNestingExceededException when depth reaches
  JsonObjectState.JsonMaxNestingDepth (256), matching the streaming
  reader's existing nesting policy.
- Have JsonObjectState.Push also throw JsonMaxNestingExceededException
  (was InvalidOperationException) so both depth-cap paths can be
  caught with a single internal exception type.
- Widen JsonBinaryEncoding.TryGetValueLength's catch from
  JsonParseException to Exception so malformed payloads honor the
  Try-pattern contract (returns false) instead of leaking
  IndexOutOfRangeException / ArgumentOutOfRangeException / etc.
- Document the new exception on JsonNavigator.Create and both
  JsonReader.Create overloads.
- Add regression tests that build deeply nested Arr1 / Obj1 payloads
  and verify the depth guard fires through JsonBinaryEncoding,
  JsonNavigator.Create and JsonReader.Create.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	Microsoft.Azure.Cosmos/src/Json/JsonReader.cs
#	changelog.md
@tvaron3
Copy link
Copy Markdown
Member Author

tvaron3 commented May 28, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@tvaron3 tvaron3 changed the title Json: cap nesting in binary value-length decoder to prevent DoS Json: Fixes unbounded recursion in binary value-length decoder to prevent DoS May 29, 2026
@tvaron3 tvaron3 marked this pull request as ready for review May 29, 2026 06:48
@kushagraThapar
Copy link
Copy Markdown
Member

Test-coverage follow-ups

Going back through this with the test-coverage lens — fix is correct, security-clean, and the binary Arr1/Obj1 decoder path is well-tested at the boundary. Two test gaps + one residual code-path worth flagging:


🟡 Major — JsonObjectState.Push() exception-type swap has no streaming-path test

The PR makes two behavioral changes to Push(): promotes JsonMaxNestingDepth to internal const and swaps InvalidOperationException(RMResources.JsonMaxNestingExceeded)JsonMaxNestingExceededException. The changelog explicitly markets the swap as customer-visible ("callers can handle both depth-cap failures with a single catch").

JsonBinaryEncodingValueLengthsTests.cs pins this for the binary value-length decoder path only. The streaming-writer / streaming-reader Push() call sites (text writer, text reader, binary writer, binary reader) have no test pinning the new exception type:

$ grep -rn "JsonMaxNestingExceeded\|JsonMaxNestingDepth" \
    Microsoft.Azure.Cosmos/tests --include="*.cs" \
    | grep -v JsonBinaryEncodingValueLengthsTests
# (empty — no streaming-path tests reference the new type)

A future refactor that regresses Push() to InvalidOperationException (or to a different JsonParseException subclass) would silently break the "single catch" promise.

Suggested test (~15 LOC):

[TestMethod]
[Owner("tvaron")]
public void StreamingPushThrowsJsonMaxNestingExceededExceptionAtCap()
{
    // Pin the changelog's "single catch type" contract on the Push() path.
    IJsonWriter writer = JsonWriter.Create(JsonSerializationFormat.Text);
    for (int i = 0; i < JsonObjectState.JsonMaxNestingDepth; i++)
    {
        writer.WriteArrayStart();
    }
    Assert.ThrowsException<JsonMaxNestingExceededException>(
        () => writer.WriteArrayStart());
}

Mirror it for JsonSerializationFormat.Binary to cover the binary writer's Push call too.


🟡 Major — Residual same-threat-model DoS via length-prefixed markers + DistinctHash / CosmosObject.GetHashCode recursion

This PR closes the decode-time vector for Arr1 (0xE1) / Obj1 (0xE9) — the single-element markers. The length-prefixed markers (ArrL1, ArrL2, ArrL4, plus L1C/L2C/L4C count+length variants, and Object equivalents) skip the new guard because GetValueLength reads the length prefix directly and doesn't recurse (ValueLengths.cs:213).

Attack path (same hostile-binary-response threat model):

  1. Hostile endpoint returns ~600 KB payload as ArrL2 0xE3 [len=N] ArrL2 0xE3 [len=N−3] … Arr0 0xE0 (≈3 bytes per level → ~200,000 levels in 600 KB).
  2. CosmosElement.CreateFromBuffer(payload) passes the new guard cleanly (root reads 2-byte length, returns).
  3. App calls DISTINCT, OR puts the element in a HashSet, OR compares two such results → DistinctHash.GetHash / CosmosObject.GetHashCode / CosmosObject.Equals recurses ~200,000 deep → StackOverflowException, uncatchable SIGABRT — same outcome as the original report.

Verified call sites:

  • Microsoft.Azure.Cosmos/src/Query/Core/Pipeline/Distinct/DistinctHash.cs:62,80,135Visit(CosmosArray ...) recurses via arrayItem.Accept(this, …).
  • Microsoft.Azure.Cosmos/src/CosmosElements/CosmosObject.cs:106,125GetHashCode / Equals recurse.

No test today covers the post-materialization recursive walkers with deep length-prefixed input.

Two options:

  • (a) Widen scope: add RuntimeHelpers.EnsureSufficientExecutionStack() at each Visit in CosmosElementHasher and at CosmosObject.Equals / GetHashCode. Cheap (1 line each) and surfaces a catchable InsufficientExecutionStackException instead of SIGABRT. Add a test mirroring DeeplyNestedObj1PayloadThrowsInsteadOfStackOverflow but with ArrL2-chain payload exercising DistinctHash.GetHash.
  • (b) Document partial scope: explicitly note in the PR description / changelog that the fix closes the Arr1/Obj1 decode-time vector, and file a follow-up issue for the post-materialization recursive walkers. Current changelog framing reads as a complete DoS fix.

(Related but distinct: StrR1-4 cyclic-reference branches in JsonWriter.JsonBinaryWriter.cs:1700-1731 are also recursive — security-review's out-of-scope-observation. Worth a separate follow-up issue.)


🔵 Nit — Obj1 boundary-at-cap test missing (mirror of Arr1)

Arr1 has both DepthAtCap (256 → throw) and DepthJustBelowCap (255 → parses cleanly). Obj1 only has DeeplyNestedObj1PayloadThrowsInsteadOfStackOverflow at 80k depth — doesn't pin the cap exactly.

$ grep -n "DepthAtCap\|DepthJustBelow" \
    Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Json/JsonBinaryEncodingValueLengthsTests.cs
# Both boundary tests use BuildArr1ChainPayload; no Obj1 counterpart.

Add DepthAtCapObj1 mirroring DepthAtCap to lock in the symmetric boundary for the Obj1 decoder path.

tvaron3 and others added 2 commits June 2, 2026 22:55
…of decoder depth guard

Layers additional defenses on top of the GetValueLength depth cap landed
earlier in this PR, addressing review feedback:

- JsonBinaryWriter.ForceRewriteRawJsonValue: new RewriteResolvedReferenceString
  helper bounds-checks the resolved target offset (unsigned compare so a
  negative StrR4 offset is rejected) and asserts the target byte is a real
  own FixReferenceStringOffsets invariant and rejects malformed
  reference-to-reference chains, cycles, and non-string targets at hop 1
  with JsonInvalidTokenException. Plus EnsureSufficientExecutionStack as a
  layered backstop at the top of ForceRewriteRawJsonValue.

- Recursive walkers on materialized CosmosElement graphs now call
  RuntimeHelpers.EnsureSufficientExecutionStack so deeply-nested
  length-prefixed payloads (ArrL1/L2/L4, ObjL1/L2/L4 -- not covered by the
  decoder depth guard) raise a catchable InsufficientExecutionStackException
  instead of an unrecoverable StackOverflowException:
    * DistinctHash.CosmosElementHasher.Visit(CosmosArray|CosmosObject)
    * CosmosArray.Equals/GetHashCode, CosmosObject.Equals/GetHashCode
    * CosmosElementToQueryLiteral.Visit(CosmosArray|CosmosObject)
    * JsonSerializer.DeserializationVisitor.Visit(CosmosArray|CosmosObject)
    * CosmosElementToSqlScalarExpressionVisitor.Visit(CosmosArray|CosmosObject)

Tests: 9 new StrR coverage cases in JsonBinaryReferenceStringTests, plus
DeeplyNested* regression tests for each walker that exercises the
end-to-end attack chain (hand-crafted ArrL4 binary payload --> deserializer
--> guard fires) on an explicitly small-stack (256 KB) thread for
cross-host determinism.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3
Copy link
Copy Markdown
Member Author

tvaron3 commented Jun 3, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Adds 7 tests exercising the EnsureSufficientExecutionStack guards on:
  - DistinctHash.Visit(CosmosObject)
  - CosmosArray.GetHashCode / CosmosObject.GetHashCode
  - CosmosArray.Equals / CosmosObject.Equals
  - CosmosElementToSqlScalarExpressionVisitor.Visit (array + object)

All run on a small-stack worker thread so the budget guard triggers
without depending on host stack size.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3
Copy link
Copy Markdown
Member Author

tvaron3 commented Jun 4, 2026

/azp run dotnet-v3-ci

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@tvaron3
Copy link
Copy Markdown
Member Author

tvaron3 commented Jun 4, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@ananth7592 ananth7592 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall,LGTM, Tomas — well-crafted, easy to verify. One small note on the changelog wording below.

One small thing on the changelog wording. The current entry reads as if every new exception type replaces a process-fatal StackOverflowException — true for the new decoder/walker/StrR guards, but slightly off for the JsonObjectState.Push change. That path was already catchable today as InvalidOperationException; it now throws JsonMaxNestingExceededException. Realistic blast radius is essentially zero (almost nobody writes 256+ nested levels), but a customer who only reads "DoS fix… instead of terminating the host" won't realize their existing catch (InvalidOperationException) on a writer call site just stopped catching. Would something like work?:

Note: the existing 256-deep writer nesting check (JsonObjectState.Push) now throws JsonMaxNestingExceededException instead of InvalidOperationException, so a single catch covers both reader and writer paths.

POSITIVES

Commenting throughout is excellent. The RewriteResolvedReferenceString XML doc, in particular, does real work — it spells out the writer invariant, the attack shape, and explains why the type check is strictly cheaper than a depth counter. The TryGetValueLength comment enumerating the five expected exception types and explicitly justifying the [AggressiveInlining] removal is the kind of "future-me thank-you" that prevents the inevitable review nit next year. And the BuildObj1ChainPayload helper's "do NOT reuse this for shallow Obj1 tests" warning is the kind of small detail that prevents the next person from misusing it.

Test coverage is impeccable. A few specifics worth calling out:

Boundary tests at exactly the cap and cap−1 — the contract is precise, not vague.
The dedicated maxStackSize: 256 * 1024 thread pattern, with the Linux-main vs ThreadPool vs macOS-pthread comment, makes the walker tests deterministic across hosts. Would have been easy to ship a flaky CI-only version instead.
The end-to-end DeeplyNestedDeserializationVisitorThrowsCatchableException going through JsonSerializer.Monadic.Deserialize proves the actual MitM attack chain is blocked, not just the unit-level guard.
Exception-type pinning (rather than just "any exception") means a future regression that surfaces a different type will fail loudly.

Per Ananth's review feedback, calls out that the existing
256-deep writer nesting check now throws
JsonMaxNestingExceededException instead of
InvalidOperationException so customers can use a single catch
across reader + writer paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@ananth7592 ananth7592 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT, Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants