feat(csharp): fix Phase 1 Thrift telemetry gaps with E2E test coverage#345
feat(csharp): fix Phase 1 Thrift telemetry gaps with E2E test coverage#345jadewang-db wants to merge 24 commits intomainfrom
Conversation
Code Review — PR #345Executive SummaryThis PR fills telemetry gaps in the Thrift code path with comprehensive E2E test coverage. The telemetry field additions are well-structured, but there are concerns around test isolation with the static Severity Summary
Positive Observations
Critical1. The previous implementation used
Suggestion: Restore High2. GetObjects/GetTableTypes emit ResultsConsumed telemetry before results are consumed
3. If 4. Initial chunk latency uses first-to-complete rather than first-chunk semantics
5. Tests silently pass when no telemetry is captured When 6. All other test files use 7. Blocking
Medium8. Despite the name 9. Missing The 10. ~120 lines of duplicated telemetry boilerplate GetObjects and GetTableTypes contain nearly identical telemetry code. Extract into a shared helper like 11. Allocates a 12. Unlike 13. Reflection-based testing is fragile Property names accessed via reflection will throw at runtime if renamed. Use 14. Reader resource leak on assertion failure
15. C# REST integration tests no longer auto-triggered SEA/REST dispatch logic removed. Confirm REST testing is covered elsewhere. 16. Six Tests for iteration, error propagation, and schema error handling were removed with 17. Stored solely for workspace ID extraction during telemetry init. After init, the reference keeps the entire session response alive. Set to Low
Recommendations (prioritized)
|
Comprehensive gap analysis of telemetry proto field coverage including: - SEA connections have zero telemetry (highest priority) - ChunkDetails.SetChunkDetails() defined but never called - Missing fields: auth_type, WorkspaceId, runtime_vendor, client_app_name - Composition via TelemetryHelper chosen over abstract base class - E2E test strategy for all proto fields across both protocols Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…A second) Co-authored-by: Isaac
… ID: task-1.1-e2e-test-infrastructure
…ion\n\nTask ID: task-1.2-system-config-missing-fields
…task-1.5-connection-params-extended
…-1.6-chunk-metrics-aggregation
…sk-1.7-expose-chunk-metrics-reader
… ID: task-1.8-call-set-chunk-details
…track-internal-calls
…Task ID: task-1.11-metadata-operation-telemetry
The demo directory was tracked as a git submodule in the index but had no entry in .gitmodules, causing all CI jobs to fail at checkout. Co-authored-by: Isaac
…remove scratch files - Remove scratch/design docs not meant for PR (PHASE1_TEST_RESULTS.md, TELEMETRY_TIMING_ISSUE.md, fix-telemetry-gaps-design.md) - Remove accidentally committed backup file (DatabricksConnection.cs.backup) - Fix trailing whitespace in test and doc files - Fix license header in ChunkMetrics.cs (was using "modified" header for new file) Co-authored-by: Isaac
Co-authored-by: Isaac
… resource cleanup - Restore AsyncLocal for ExporterOverride to prevent parallel test interference - Add missing IsInternalCall assertion in InternalCallTests - Replace silent-pass with Skip.If when no telemetry captured in baseline tests - Use await instead of .Result for async calls in MetadataOperationTests - Add TimestampMillis to metadata operation telemetry Context - Cache Process.GetCurrentProcess() call in BuildSystemConfiguration - Move reader disposal to finally blocks in ChunkMetricsReaderTests Co-authored-by: Isaac
ebb8f6c to
d504b0a
Compare
Range-diff: stack/fix-telemetry-gaps-design (ebb8f6c -> d504b0a)
Reproduce locally: |
csharp/src/DatabricksConnection.cs
Outdated
| { | ||
| try | ||
| { | ||
| telemetryContext.RecordFirstBatchReady(); |
There was a problem hiding this comment.
Same here metadata query
There was a problem hiding this comment.
✅ Done — Same fix applied. Both GetObjects and GetTableTypes now use the shared ExecuteWithMetadataTelemetry<T>() helper with no batch/consumption timing.
csharp/src/DatabricksConnection.cs
Outdated
| } | ||
|
|
||
| // Strategy 2: Check connection property as fallback | ||
| if (workspaceId == 0 && Properties.TryGetValue("adbc.databricks.workspace_id", out string? workspaceIdProp)) |
There was a problem hiding this comment.
I do not think we ever support this param: adbc.databricks.workspace_id
There was a problem hiding this comment.
✅ Done — Removed the adbc.databricks.workspace_id property lookup and the config-based extraction. Now using PropertyHelper.ParseOrgIdFromProperties(Properties) which extracts the org ID from the HTTP path query string (e.g., ?o=12345). This is the same org ID used for the x-databricks-org-id header elsewhere in the driver.
There was a problem hiding this comment.
Why this look up still exist? adbc.databricks.workspace_id
There was a problem hiding this comment.
Extract org ID from Http path would only work for SPOG url where there is org id, not for current legacy urls.
I think best chance is still to extract the orgId from opensession response http header.
There was a problem hiding this comment.
in telemetry reporting, there is no need for workspace id, we should remove the logic
There was a problem hiding this comment.
✅ Done — Good point about SPOG vs legacy URLs. Replaced PropertyHelper.ParseOrgIdFromProperties() with a new OrgIdCaptureHandler (DelegatingHandler) that captures the x-databricks-org-id header from the first successful HTTP response. This works for both SPOG and legacy URLs since the header is always present in Thrift call responses. Also removed the test that used the unsupported adbc.databricks.workspace_id connection property.
This comment was generated with GitHub MCP.
csharp/src/DatabricksConnection.cs
Outdated
| /// <returns>The batch size value.</returns> | ||
| private int GetBatchSize() | ||
| { | ||
| const int DefaultBatchSize = 50000; // HiveServer2Connection.BatchSizeDefault |
There was a problem hiding this comment.
Databricks default is 2M:
private const long DatabricksBatchSizeDefault = 2000000;
There was a problem hiding this comment.
✅ Done — Fixed default to 2000000 to match DatabricksStatement.DatabricksBatchSizeDefault.
jadewang-db
left a comment
There was a problem hiding this comment.
Submitting pending review to unblock comment replies.
Range-diff: stack/fix-telemetry-gaps-design (d504b0a -> 6a928e7)
Reproduce locally: |
Range-diff: stack/fix-telemetry-gaps-design (6a928e7 -> 490d943)
Reproduce locally: |
csharp/src/DatabricksConnection.cs
Outdated
| } | ||
|
|
||
| // Strategy 2: Check connection property as fallback | ||
| if (workspaceId == 0 && Properties.TryGetValue("adbc.databricks.workspace_id", out string? workspaceIdProp)) |
There was a problem hiding this comment.
Why this look up still exist? adbc.databricks.workspace_id
csharp/src/DatabricksConnection.cs
Outdated
| } | ||
|
|
||
| // Strategy 2: Check connection property as fallback | ||
| if (workspaceId == 0 && Properties.TryGetValue("adbc.databricks.workspace_id", out string? workspaceIdProp)) |
There was a problem hiding this comment.
Extract org ID from Http path would only work for SPOG url where there is org id, not for current legacy urls.
I think best chance is still to extract the orgId from opensession response http header.
csharp/src/DatabricksConnection.cs
Outdated
| private string GetClientAppName(string processName) | ||
| { | ||
| // Check connection property first, fall back to process name | ||
| Properties.TryGetValue("adbc.databricks.client_app_name", out string? appName); |
There was a problem hiding this comment.
I am not seeing adbc.databricks.client_app_name as a valid databricks parameter?
There was a problem hiding this comment.
this method is not needed anymore, it should be removed.
There was a problem hiding this comment.
✅ Already removed — GetClientAppName method and adbc.databricks.client_app_name property lookup were removed in the previous round. ClientAppName is now set directly to processName in BuildSystemConfiguration().
This comment was generated with GitHub MCP.
csharp/src/DatabricksConnection.cs
Outdated
| private string GetClientAppName(string processName) | ||
| { | ||
| // Check connection property first, fall back to process name | ||
| Properties.TryGetValue("adbc.databricks.client_app_name", out string? appName); |
There was a problem hiding this comment.
this method is not needed anymore, it should be removed.
csharp/src/DatabricksConnection.cs
Outdated
| } | ||
|
|
||
| // Strategy 2: Check connection property as fallback | ||
| if (workspaceId == 0 && Properties.TryGetValue("adbc.databricks.workspace_id", out string? workspaceIdProp)) |
There was a problem hiding this comment.
in telemetry reporting, there is no need for workspace id, we should remove the logic
…ID from response header - Use DatabricksStatement.DatabricksBatchSizeDefault directly (changed to internal) - Use ConnectTimeoutMilliseconds from base class instead of duplicate local const - Extract workspace ID from x-databricks-org-id response header via new OrgIdCaptureHandler, replacing HTTP path parsing (works for SPOG + legacy URLs) - Remove test using unsupported adbc.databricks.workspace_id property Co-authored-by: Isaac
Range-diff: stack/fix-telemetry-gaps-design (490d943 -> 19ae1c0)
Reproduce locally: |
…data commands Statement-level metadata commands (getcatalogs, gettables, getcolumns, etc.) executed via DatabricksStatement.ExecuteQuery were incorrectly tagged as StatementType.Query/OperationType.ExecuteStatement. This fix correctly emits StatementType.Metadata with the appropriate OperationType (ListCatalogs, ListTables, ListColumns, etc.), aligning with the connection-level GetObjects telemetry. The two paths remain distinguishable via sql_statement_id (populated for statement path, empty for GetObjects path). Co-authored-by: Isaac
Range-diff: stack/fix-telemetry-gaps-design (19ae1c0 -> 8da5457)
Reproduce locally: |
eric-wang-1990
left a comment
There was a problem hiding this comment.
Overall the approach is solid — deferred telemetry emission to Dispose() is the right call for chunk metrics. A few issues worth fixing before merge.
| /// This org ID is used for telemetry workspace identification. | ||
| /// </summary> | ||
| internal class OrgIdCaptureHandler : DelegatingHandler | ||
| { |
There was a problem hiding this comment.
Thread safety: _capturedOrgId is read and written without synchronization. Two concurrent requests could both pass the == null check and both write. In practice they'd write the same value, but volatile would make the intent clear and avoid any compiler/JIT reordering issues:
private volatile string? _capturedOrgId;There was a problem hiding this comment.
we don't need capture org id for telemetry, let's remove it
csharp/src/DatabricksStatement.cs
Outdated
|
|
||
| // Extract retry count from Activity if available | ||
| if (Activity.Current != null) | ||
| { |
There was a problem hiding this comment.
Activity.Current is thread-local. If EmitTelemetry is called from Dispose() on a different thread than where the query executed, Activity.Current will be null and retry count will silently be 0. The retry count should be captured at execute time (when the activity is still current) and stored on the context, not read lazily at emit time.
…re, harden metrics - Remove OrgIdCaptureHandler and workspace ID extraction logic (not needed for telemetry) - Capture retry count at execute time instead of lazily at Dispose time (Activity.Current is thread-local) - Use -1 sentinel for _initialChunkLatencyMs to handle genuine 0ms downloads - Change ChunkMetrics to internal setters to prevent accidental mutation after construction - Remove WorkspaceIdTests (tested removed functionality) Co-authored-by: Isaac
Range-diff: stack/fix-telemetry-gaps-design (8da5457 -> e85c9e2)
Reproduce locally: |

🥞 Stacked PR
Use this link to review incremental changes.
Summary
Closes telemetry gaps in the Thrift (HiveServer2) code path by populating missing fields in telemetry events and adding comprehensive E2E test coverage.
Telemetry field fixes
runtime_vendorandclient_app_nameinDriverSystemConfigurationauth_typeon the root telemetry logWorkspaceIdinTelemetrySessionContextDriverConnectionParameterswith additional fieldsChunkMetricsaggregation inCloudFetchDownloader, expose viaCloudFetchReaderinterface, and callSetChunkDetails()inDatabricksStatement.EmitTelemetry()retry_countinSqlExecutionEventis_internal_callflagGetObjectsandGetTableTypesE2E test infrastructure & coverage
CapturingTelemetryExportertest infrastructure for intercepting and asserting on telemetry eventsTest plan