Skip to content

fix(csharp): always send CloseOperation from DatabricksCompositeReader.Dispose#360

Merged
eric-wang-1990 merged 3 commits intomainfrom
fix/close-operation-on-cloudfetch-dispose
Mar 18, 2026
Merged

fix(csharp): always send CloseOperation from DatabricksCompositeReader.Dispose#360
eric-wang-1990 merged 3 commits intomainfrom
fix/close-operation-on-cloudfetch-dispose

Conversation

@eric-wang-1990
Copy link
Copy Markdown
Collaborator

@eric-wang-1990 eric-wang-1990 commented Mar 18, 2026

Problem

DatabricksCompositeReader.Dispose only called CloseOperation when _activeReader was null. When a CloudFetchReader was active, it delegated Dispose to the reader — but CloudFetchReader is protocol-agnostic (it downloads from cloud storage over HTTP) and never sends CloseOperation to the Thrift endpoint.

Result: every CloudFetch query orphaned the server-side operation for ~1 hour, until SQL Gateway's driver-router detected inactivity and fired CommandInactivityTimeout. This produced thriftOperationCloseReason=CommandInactivityTimeout in usage logs for all CloudFetch queries in connection-pooled scenarios (where CloseSession is never sent).

Root cause

The bug was introduced in commit 0039ee3"refactor(csharp): make CloudFetch pipeline protocol-agnostic (#14)".

Before that commit:

  • BaseDatabricksReader had a public CloseOperationAsync() method that sent the Thrift RPC
  • DatabricksCompositeReader.Dispose called _activeReader.CloseOperationAsync() — correct for both DatabricksReader and CloudFetchReader (both inherited from BaseDatabricksReader)

The protocol-agnostic refactor made three changes that together caused the regression:

  1. Removed CloseOperationAsync() from BaseDatabricksReader
  2. Changed DatabricksCompositeReader.Dispose to call _activeReader.Dispose() instead of _activeReader.CloseOperationAsync(), with the comment "Have the contained reader close the operation to avoid duplicate calls"
  3. Made CloudFetchReader protocol-agnostic (no Thrift dependency) — its Dispose only cleans up HTTP/download resources and never sends CloseOperation

DatabricksReader got its own CloseOperationAsync() called from Dispose, so inline results remained correct. But the CloudFetch path silently lost its cleanup.

Fix

Move CloseOperation ownership entirely to DatabricksCompositeReader.Dispose, which holds both _statement (Thrift client) and _response (operation handle). HiveServer2Reader.CloseOperationAsync already handles the DirectResults case correctly — it is a no-op when the server already closed the operation inline.

Remove CloseOperation from DatabricksReader.Dispose to avoid a duplicate call; DatabricksReader is only ever constructed from DatabricksCompositeReader.CreateDatabricksReader.

Behavior after fix

Result delivery path CloseOperation sent?
Inline + DirectResults enabled No (server closed inline — CloseOperationAsync is a no-op)
Inline + DirectResults disabled Yes (explicit Thrift RPC from composite reader)
CloudFetch Yes (explicit Thrift RPC from composite reader — was previously missing)

Testing

Validated with 4 new proxy-based regression tests in databricks/databricks-driver-test (CLOUDFETCH-013 through 016) that simulate connection pooling (reader disposed without closing the connection) and assert the correct number of CloseOperation Thrift calls for each path.

…r.Dispose

Previously, DatabricksCompositeReader.Dispose only called CloseOperation
when _activeReader was null. When a CloudFetchReader was active, it
delegated Dispose to the reader, but CloudFetchReader is protocol-agnostic
and never sends CloseOperation. This orphaned every CloudFetch server
operation for ~1 hour until SQL Gateway fired CommandInactivityTimeout,
producing thriftOperationCloseReason=CommandInactivityTimeout in usage logs.

Move CloseOperation ownership entirely to DatabricksCompositeReader.Dispose,
which holds both _statement (Thrift client) and _response (operation handle).
HiveServer2Reader.CloseOperationAsync is already a no-op when DirectResults
already closed the operation server-side, so all three result paths are safe:
- Inline + DirectResults enabled: CloseOperation is a no-op (already closed)
- Inline + DirectResults disabled: CloseOperation sent explicitly
- CloudFetch: CloseOperation sent explicitly (was previously missing)

Remove CloseOperation from DatabricksReader.Dispose to avoid duplicate calls;
DatabricksReader is only ever constructed from DatabricksCompositeReader.
@eric-wang-1990 eric-wang-1990 requested review from jadewang-db and msrathore-db and removed request for jadewang-db March 18, 2026 21:53
else
// Always close the operation here at the composite level.
// CloudFetchReader is protocol-agnostic and does not send CloseOperation,
// so we must not rely on the contained reader to do it.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call dispose?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Done — added CloseOperationE2ETest in csharp/test/E2E/CloseOperationE2ETest.cs (latest commit). It covers all three result delivery modes (Inline+DirectResults, Inline+NoDirectResults, CloudFetch) using an ActivityListener to assert the composite_reader.close_operation trace event is emitted during Dispose. Verified: all 3 tests fail when the fix is reverted, and pass with the fix in place.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — _activeReader.Dispose() is still called (lines 251-252), after CloseOperationAsync. The change is that CloseOperation is now always called unconditionally at the composite level first, regardless of whether _activeReader is null. Then the active reader is disposed for its own resource cleanup (HTTP connections, download buffers, etc.). This separates the Thrift protocol cleanup from the reader's internal resource cleanup.

…ksCompositeReader.Dispose

Add CloseOperationE2ETest with 3 theory cases covering all result delivery
modes (Inline+DirectResults, Inline+NoDirectResults, CloudFetch). Each test
disposes only the reader — not the connection — to simulate connection pooling,
then uses ActivityListener to assert the composite_reader.close_operation trace
event is emitted inside DatabricksCompositeReader.Dispose.

This event is only present after the fix. Without it:
- Inline paths: event missing because _activeReader != null caused the old code
  to delegate to _activeReader.Dispose() without calling CloseOperation.
- CloudFetch: same delegation but CloseOperationAsync is never called at all,
  orphaning the server operation for ~1 hour
  (thriftOperationCloseReason=CommandInactivityTimeout).

For Thrift wire-level assertions, see proxy-based tests in
databricks/databricks-driver-test: CLOUDFETCH-013 through CLOUDFETCH-016.
…oseOperationE2ETest

- Replace "modified" license header with plain Apache 2.0 header (RAT check)
- Replace ConcurrentBag with locked List<T> for .NET 4.7.2 compatibility
  (ConcurrentBag.Clear() was added in .NET 5; Windows CI builds net472)
@eric-wang-1990 eric-wang-1990 added this pull request to the merge queue Mar 18, 2026
Merged via the queue into main with commit 8ec57f0 Mar 18, 2026
16 checks passed
@eric-wang-1990 eric-wang-1990 deleted the fix/close-operation-on-cloudfetch-dispose branch March 18, 2026 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants