Skip to content

Conversation

askumar27
Copy link
Contributor

@askumar27 askumar27 commented Oct 10, 2025

[WIP] UTs for response_time_telemetry.py pending design review

Summary

This PR introduces a comprehensive response time telemetry system for DataHub's ingestion framework, enabling efficient tracking and percentile calculation of API response times using the t-digest algorithm. The implementation is memory-efficient, performant, and provides detailed insights into API performance patterns.

Initial integration is demonstrated with the Looker source connector, tracking response times across all Looker API operations.


Sample Output

{
  "dashboard": {
    "min": { "time_in_secs": 0.226, "context": { "dashboard_id": "54" } },
    "max": { "time_in_secs": 1.471, "context": { "dashboard_id": "1" } },
    "mean": 0.487,
    "count": 54,
    "total_time_in_secs": 26.271,
    "percentiles_in_secs": { "50": 0.389, "90": 0.902, "95": 1.03, "99": 1.465 }
  },
  "folder_ancestors": {
    "min": { "time_in_secs": 0.15, "context": { "folder_id": "34" } },
    "max": { "time_in_secs": 0.221, "context": { "folder_id": "10" } },
    "mean": 0.178,
    "count": 53,
    "total_time_in_secs": 9.46,
    "percentiles_in_secs": { "50": 0.177, "90": 0.201, "95": 0.204, "99": 0.22 }
  }
}

🎯 Motivation

Problem Statement

  • No visibility into API performance during ingestion runs
  • Difficult to diagnose slow data source connections
  • Cannot identify performance bottlenecks or problematic API endpoints
  • Lack of metrics for percentile-based analysis (P50, P90, P95, P99)

Business Value

  • 📊 Performance insights: Understand which API calls are slow and why
  • 🔍 Debugging support: Context-aware tracking helps identify problematic requests
  • 🎯 SLA monitoring: Track percentile-based metrics for API performance
  • 📈 Trend analysis: Monitor API performance over time and across sources

🏗️ Architecture

Core Components

1. Response Time Telemetry Utility (response_time_telemetry.py)

New file: src/datahub/telemetry/response_time_telemetry.py (372 lines)

Key Classes:

  • ResponseTimeTelemetry: Tracks metrics for a single API type using t-digest
  • ResponseTimeMetrics: Aggregates metrics across multiple API types
  • ResponseTimeTracker: Context manager for automatic time tracking

Features:

  • ✅ T-digest algorithm for memory-efficient percentile calculation
  • ✅ Configurable percentiles (default: P50, P90, P95, P99)
  • ✅ Min/max tracking with context information
  • ✅ Recent contexts window (configurable, default: 10)
  • ✅ Thread-safe operations
  • ✅ Context manager for clean API usage

2. Telemetry Configuration (telemetry_config.py)

Modified file: src/datahub/configuration/telemetry_config.py

New Configuration Options:

class TelemetryConfig(ConfigModel):
    disable_response_time_collection: bool = False
    capture_response_times_pattern: AllowDenyPattern = AllowDenyPattern.allow_all()

Allows users to:

  • Disable telemetry entirely
  • Filter which API types to track (regex patterns)

3. Looker Integration (looker_lib_wrapper.py)

Modified file: src/datahub/ingestion/source/looker/looker_lib_wrapper.py

Changes:

  • Integrated response time tracking across all Looker API methods
  • Tracks 15+ API operation types (folders, dashboards, looks, queries, etc.)
  • Provides context for each call (e.g., folder ID, dashboard ID)
  • Removed old latency tracking code (simplified)

📊 Performance Analysis

T-Digest vs Naive Approach

We evaluated the t-digest approach against a naive implementation that stores all data points:

Test Results: 50,000 API Response Times

Metric Naive Approach T-Digest Improvement
Memory Usage +11.52 MB -6.12 MB 153% reduction 💾
Add Performance 99.5 sec 44.5 sec 2.24x faster
Percentile Calc 4.27 ms 0.93 ms 4.62x faster 🚀
P50 Accuracy baseline 0.01% error Excellent
P95 Accuracy baseline 0.00% error Perfect
P99 Accuracy baseline 0.08% error Excellent

Test Results: 100,000 API Response Times

Metric Naive Approach T-Digest Improvement
Memory Usage +7.41 MB -14.36 MB 294% reduction 💾
Add Performance 436.1 sec 95.0 sec 4.59x faster
Percentile Calc 8.97 ms 0.99 ms 9.11x faster 🚀
P50 Accuracy baseline 0.01% error Excellent
P95 Accuracy baseline 0.05% error Excellent
P99 Accuracy baseline 0.01% error Perfect

Key Insights:

  • Memory savings are critical for long-running ingestion jobs (up to 294% reduction)
  • Performance gains scale superlinearly with data size (2.24x → 4.59x speedup)
  • Percentile calculation becomes 9x faster at 100k points
  • Accuracy remains exceptional (<1% error for P50, P95, P99) even at scale
  • The naive approach takes 7+ minutes for 100k points vs 1.5 minutes for t-digest

New Dependencies

  • tdigest (>=2.0.0): T-digest algorithm implementation
    • License: MIT
    • Mature library with 1.3k+ stars on GitHub
    • Used by production systems at scale

Compatibility

  • ✅ Python 3.8+
  • ✅ No breaking changes to existing APIs
  • ✅ Backward compatible with existing configurations

📚 References

…r efficient percentile calculation and configuration options
… fields from LookerDashboardSourceReport and BaseStatGenerator
…thods to enforce pre-data constraints and improve usability
…d enhance recent contexts handling with deque
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 10, 2025
Copy link

codecov bot commented Oct 10, 2025

Codecov Report

❌ Patch coverage is 76.96970% with 38 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ahub/ingestion/source/looker/looker_lib_wrapper.py 64.40% 21 Missing ⚠️
...n/src/datahub/telemetry/response_time_telemetry.py 83.16% 17 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

alwaysmeticulous bot commented Oct 10, 2025

✅ Meticulous spotted 0 visual differences across 950 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

Expected differences? Click here. Last updated for commit a04961f. This comment will update as new commits are pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant