feat(ingestion): Added response time telemetry utility and add telemetry for Looker client #14970

askumar27 · 2025-10-10T04:07:32Z

[WIP] UTs for response_time_telemetry.py pending design review

Summary

This PR introduces a comprehensive response time telemetry system for DataHub's ingestion framework, enabling efficient tracking and percentile calculation of API response times using the t-digest algorithm. The implementation is memory-efficient, performant, and provides detailed insights into API performance patterns.

Initial integration is demonstrated with the Looker source connector, tracking response times across all Looker API operations.

Sample Output

{
  "dashboard": {
    "min": { "time_in_secs": 0.226, "context": { "dashboard_id": "54" } },
    "max": { "time_in_secs": 1.471, "context": { "dashboard_id": "1" } },
    "mean": 0.487,
    "count": 54,
    "total_time_in_secs": 26.271,
    "percentiles_in_secs": { "50": 0.389, "90": 0.902, "95": 1.03, "99": 1.465 }
  },
  "folder_ancestors": {
    "min": { "time_in_secs": 0.15, "context": { "folder_id": "34" } },
    "max": { "time_in_secs": 0.221, "context": { "folder_id": "10" } },
    "mean": 0.178,
    "count": 53,
    "total_time_in_secs": 9.46,
    "percentiles_in_secs": { "50": 0.177, "90": 0.201, "95": 0.204, "99": 0.22 }
  }
}

🎯 Motivation

Problem Statement

No visibility into API performance during ingestion runs
Difficult to diagnose slow data source connections
Cannot identify performance bottlenecks or problematic API endpoints
Lack of metrics for percentile-based analysis (P50, P90, P95, P99)

Business Value

📊 Performance insights: Understand which API calls are slow and why
🔍 Debugging support: Context-aware tracking helps identify problematic requests
🎯 SLA monitoring: Track percentile-based metrics for API performance
📈 Trend analysis: Monitor API performance over time and across sources

🏗️ Architecture

Core Components

1. Response Time Telemetry Utility (`response_time_telemetry.py`)

New file: src/datahub/telemetry/response_time_telemetry.py (372 lines)

Key Classes:

ResponseTimeTelemetry: Tracks metrics for a single API type using t-digest
ResponseTimeMetrics: Aggregates metrics across multiple API types
ResponseTimeTracker: Context manager for automatic time tracking

Features:

✅ T-digest algorithm for memory-efficient percentile calculation
✅ Configurable percentiles (default: P50, P90, P95, P99)
✅ Min/max tracking with context information
✅ Recent contexts window (configurable, default: 10)
✅ Thread-safe operations
✅ Context manager for clean API usage

2. Telemetry Configuration (`telemetry_config.py`)

Modified file: src/datahub/configuration/telemetry_config.py

New Configuration Options:

class TelemetryConfig(ConfigModel):
    disable_response_time_collection: bool = False
    capture_response_times_pattern: AllowDenyPattern = AllowDenyPattern.allow_all()

Allows users to:

Disable telemetry entirely
Filter which API types to track (regex patterns)

3. Looker Integration (`looker_lib_wrapper.py`)

Modified file: src/datahub/ingestion/source/looker/looker_lib_wrapper.py

Changes:

Integrated response time tracking across all Looker API methods
Tracks 15+ API operation types (folders, dashboards, looks, queries, etc.)
Provides context for each call (e.g., folder ID, dashboard ID)
Removed old latency tracking code (simplified)

📊 Performance Analysis

T-Digest vs Naive Approach

We evaluated the t-digest approach against a naive implementation that stores all data points:

Test Results: 50,000 API Response Times

Metric	Naive Approach	T-Digest	Improvement
Memory Usage	+11.52 MB	-6.12 MB	153% reduction 💾
Add Performance	99.5 sec	44.5 sec	2.24x faster ⚡
Percentile Calc	4.27 ms	0.93 ms	4.62x faster 🚀
P50 Accuracy	baseline	0.01% error	Excellent ✅
P95 Accuracy	baseline	0.00% error	Perfect ✅
P99 Accuracy	baseline	0.08% error	Excellent ✅

Test Results: 100,000 API Response Times

Metric	Naive Approach	T-Digest	Improvement
Memory Usage	+7.41 MB	-14.36 MB	294% reduction 💾
Add Performance	436.1 sec	95.0 sec	4.59x faster ⚡
Percentile Calc	8.97 ms	0.99 ms	9.11x faster 🚀
P50 Accuracy	baseline	0.01% error	Excellent ✅
P95 Accuracy	baseline	0.05% error	Excellent ✅
P99 Accuracy	baseline	0.01% error	Perfect ✅

Key Insights:

Memory savings are critical for long-running ingestion jobs (up to 294% reduction)
Performance gains scale superlinearly with data size (2.24x → 4.59x speedup)
Percentile calculation becomes 9x faster at 100k points
Accuracy remains exceptional (<1% error for P50, P95, P99) even at scale
The naive approach takes 7+ minutes for 100k points vs 1.5 minutes for t-digest

New Dependencies

tdigest (>=2.0.0): T-digest algorithm implementation
- License: MIT
- Mature library with 1.3k+ stars on GitHub
- Used by production systems at scale

Compatibility

✅ Python 3.8+
✅ No breaking changes to existing APIs
✅ Backward compatible with existing configurations

📚 References

T-Digest Algorithm Paper - "Computing Extremely Accurate Quantiles Using t-Digests" by Ted Dunning
tdigest Python Library

…r efficient percentile calculation and configuration options

…ds for enhanced performance tracking

… fields from LookerDashboardSourceReport and BaseStatGenerator

…thods to enforce pre-data constraints and improve usability

…d enhance recent contexts handling with deque

…handle optional TelemetryConfig

codecov · 2025-10-10T04:10:34Z

Codecov Report

❌ Patch coverage is 76.96970% with 38 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ahub/ingestion/source/looker/looker_lib_wrapper.py	64.40%	21 Missing ⚠️
...n/src/datahub/telemetry/response_time_telemetry.py	83.16%	17 Missing ⚠️

📢 Thoughts on this report? Let us know!

alwaysmeticulous · 2025-10-10T04:12:42Z

✅ Meticulous spotted 0 visual differences across 950 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

_{Expected differences? Click here. Last updated for commit a04961f. This comment will update as new commits are pushed.}

…al list of percentiles

askumar27 added 6 commits October 9, 2025 14:26

feat(telemetry): add response time telemetry utility with t-digest fo…

8820eea

…r efficient percentile calculation and configuration options

feat(looker): integrate response time telemetry into Looker API metho…

595e627

…ds for enhanced performance tracking

refactor(looker): remove unused latency reporting methods and related…

e8992d5

… fields from LookerDashboardSourceReport and BaseStatGenerator

refactor(telemetry): enhance response time telemetry configuration me…

596d936

…thods to enforce pre-data constraints and improve usability

refactor(telemetry): improve response time telemetry documentation an…

5a5932c

…d enhance recent contexts handling with deque

refactor(telemetry): update response time telemetry configuration to …

2f32c43

…handle optional TelemetryConfig

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 10, 2025

github-actions bot deployed to datahub-wheels (Preview) October 10, 2025 04:09 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) October 10, 2025 04:11 View deployment

vercel bot deployed to Preview October 10, 2025 04:13 View deployment

lint(telemetry): update configure_percentiles method to accept option…

a04961f

…al list of percentiles

github-actions bot deployed to datahub-wheels (Preview) October 10, 2025 04:18 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) October 10, 2025 04:20 View deployment

vercel bot deployed to Preview October 10, 2025 04:33 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ingestion): Added response time telemetry utility and add telemetry for Looker client #14970

feat(ingestion): Added response time telemetry utility and add telemetry for Looker client #14970

Uh oh!

askumar27 commented Oct 10, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

alwaysmeticulous bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(ingestion): Added response time telemetry utility and add telemetry for Looker client #14970

Are you sure you want to change the base?

feat(ingestion): Added response time telemetry utility and add telemetry for Looker client #14970

Uh oh!

Conversation

askumar27 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Sample Output

🎯 Motivation

Problem Statement

Business Value

🏗️ Architecture

Core Components

1. Response Time Telemetry Utility (response_time_telemetry.py)

2. Telemetry Configuration (telemetry_config.py)

3. Looker Integration (looker_lib_wrapper.py)

📊 Performance Analysis

T-Digest vs Naive Approach

Test Results: 50,000 API Response Times

Test Results: 100,000 API Response Times

New Dependencies

Compatibility

📚 References

Uh oh!

codecov bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alwaysmeticulous bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

askumar27 commented Oct 10, 2025 •

edited

Loading

1. Response Time Telemetry Utility (`response_time_telemetry.py`)

2. Telemetry Configuration (`telemetry_config.py`)

3. Looker Integration (`looker_lib_wrapper.py`)

codecov bot commented Oct 10, 2025 •

edited

Loading

alwaysmeticulous bot commented Oct 10, 2025 •

edited

Loading