Skip to content

Conversation

@dinesh-verma-datahub
Copy link
Contributor

@dinesh-verma-datahub dinesh-verma-datahub commented Dec 17, 2025

Summary

This PR adds a new pushdown_user_filter configuration option for BigQuery that pushes user_email_pattern filtering directly to BigQuery's SQL query, reducing data transfer and improving performance for large query volumes.

Changes

New Configuration Option

  • Added pushdown_user_filter: bool (default: False) to BigQueryV2Config and BigQueryQueriesExtractorConfig
  • When enabled, converts user_email_pattern (Python regex) to BigQuery SQL using REGEXP_CONTAINS()
  • Both modes (pushdown and client-side) produce identical filtering results

Implementation Details

  • _build_user_filter_from_pattern(): Converts AllowDenyPattern to BigQuery-compatible SQL WHERE clause
  • _escape_for_bigquery_string(): SQL-injection-safe escaping (backslashes then quotes)
  • _is_allow_all_pattern(): Detects common "match all" patterns to skip unnecessary filtering
  • Case-insensitive support: Uses BigQuery's (?i) regex flag when ignoreCase=True

Security

  • Proper two-step escaping prevents SQL injection attacks
  • Uses regular strings (not raw strings) with proper escape sequences
  • Comprehensive security tests for quote breakout and backslash-quote bypass attempts

Documentation

  • Added comprehensive "User Email Filtering Pushdown" section to bigquery_pre.md
  • Documents when to use, example configuration, behavior, and prerequisites

Example Configuration

source:
type: bigquery
config:
use_queries_v2: true # Required for pushdown
pushdown_user_filter: true # Enable pushdown optimization
user_email_pattern:
allow:
- "analyst_.@example\.com"
deny:
- "bot_.
"### Testing

  • 55+ unit tests covering:
    • Core filter building logic (27 tests)
    • SQL-injection-safe escaping (6 tests)
    • Allow-all pattern detection (8 tests)
    • Query builder integration (4 tests)
    • End-to-end flow tests (4 tests)
    • Integration with extractor (2 tests)
    • Security tests for SQL injection prevention (4 tests)

Checklist

  • The PR conforms to DataHub's Contributing Guideline
  • Tests for the changes have been added/updated
  • Documentation has been updated
  • Code passes linting (ruff check, ruff format)
  • Prettier formatting applied to markdown files

…iltering to BigQuery

Add a new `pushdown_user_filter` configuration option that enables pushing
the existing `user_email_pattern` filtering to BigQuery's INFORMATION_SCHEMA.JOBS
query using REGEXP_CONTAINS for improved performance.

Changes:
- Add `pushdown_user_filter` boolean config (default: false)
- Add `_build_user_filter_from_pattern()` to convert AllowDenyPattern to SQL
- Update query builder to accept user_filter parameter
- Wire config from BigQueryV2Config to the extractor
- Add comprehensive unit tests (30+ test cases)

Benefits:
- Single source of truth: reuses existing `user_email_pattern` config
- Backward compatible: disabled by default
- Full regex support via BigQuery REGEXP_CONTAINS()
- Improved performance for large query volumes

This follows the same pattern as Snowflake's pushdown filtering.
@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Dec 17, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 17, 2025
@codecov
Copy link

codecov bot commented Dec 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

Address security review feedback:

1. SQL Injection Prevention:
   - Switch from raw strings (r'...') to regular string literals
   - Properly escape backslashes first, then single quotes
   - This prevents quote breakout attacks like: test') OR 1=1 --

2. Improved Allow-All Pattern Detection:
   - Add _is_allow_all_pattern() helper function
   - Recognize common allow-all patterns: .*, .+, ^.*$, ^.+$
   - Reduces unnecessary filtering overhead

3. Add Security Tests:
   - Quote breakout SQL injection attempts
   - Backslash-quote escape bypass attempts
   - Multiple backslash edge cases
   - Full integration security test

4. Add Helper Function Tests:
   - TestEscapeForBigQueryString class
   - TestIsAllowAllPattern class
…00% code coverage

Address security review feedback and improve code quality:

Security Fixes:
- Switch from raw strings (r'...') to regular string literals
- Implement two-step escaping: backslashes first, then quotes
- Add comprehensive security tests for SQL injection prevention

Code Improvements:
- Add _is_allow_all_pattern() helper for pattern detection
- Use List[str] type hints instead of bare list
- Add detailed security notes in docstrings
- Enhance module-level docstring with test organization

Test Coverage (100%):
- Add TestFetchRegionQueryLogWithPushdown for integration tests
- Cover pushdown_user_filter=True path (lines 410-413)
- Cover pushdown_user_filter=False path (lines 414-416)
- 55+ test cases across 6 test classes
…er_filter

1. Missing Test Coverage:
   - Add test_whitespace_nonwhitespace_star_is_allow_all for [\s\S]*
   - Add test_whitespace_nonwhitespace_plus_is_allow_all for [\s\S]+
   - All 6 patterns in _is_allow_all_pattern() now have test coverage

2. User Documentation Enhancement:
   - Add comprehensive 'User Email Filtering Pushdown' section to bigquery_pre.md
   - Document when to use, example configuration, behavior, and prerequisites
   - Link from features list to new detailed section

3. Python 3.9 Compatibility Fix:
   - Fix parenthesized with statement syntax (Python 3.10+ only)
   - Use traditional 'with a, b:' syntax for Python 3.9 compatibility
   - This ensures TestFetchRegionQueryLogWithPushdown tests run on CI
Address CI test failures:

1. Fix failing tests:
   - Add ignoreCase=False to tests that check pattern translation logic
   - AllowDenyPattern defaults to ignoreCase=True which adds (?i) prefix
   - Tests now explicitly test pattern translation in isolation

2. Improve _is_allow_all_pattern() docstring:
   - List all 6 recognized 'allow all' patterns with descriptions
   - Document why multiple patterns are never considered 'allow all'

3. Add debug logging in _build_user_filter_from_pattern():
   - Log input patterns at translation start
   - Log each pattern's escape transformation
   - Log when 'allow all' patterns are detected and skipped
   - Log final generated SQL filter

4. Add documentation note to test file:
   - Explain why most tests use ignoreCase=False
   - Reference dedicated case-sensitivity tests for maintainers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants