-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(bigquery): Add pushdown_user_filter option to push user_email_pattern filtering to BigQuery SQL for improved performance #15699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dinesh-verma-datahub
wants to merge
8
commits into
master
Choose a base branch
from
feature/bigquery-pushdown-user-filter
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…iltering to BigQuery Add a new `pushdown_user_filter` configuration option that enables pushing the existing `user_email_pattern` filtering to BigQuery's INFORMATION_SCHEMA.JOBS query using REGEXP_CONTAINS for improved performance. Changes: - Add `pushdown_user_filter` boolean config (default: false) - Add `_build_user_filter_from_pattern()` to convert AllowDenyPattern to SQL - Update query builder to accept user_filter parameter - Wire config from BigQueryV2Config to the extractor - Add comprehensive unit tests (30+ test cases) Benefits: - Single source of truth: reuses existing `user_email_pattern` config - Backward compatible: disabled by default - Full regex support via BigQuery REGEXP_CONTAINS() - Improved performance for large query volumes This follows the same pattern as Snowflake's pushdown filtering.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Address security review feedback: 1. SQL Injection Prevention: - Switch from raw strings (r'...') to regular string literals - Properly escape backslashes first, then single quotes - This prevents quote breakout attacks like: test') OR 1=1 -- 2. Improved Allow-All Pattern Detection: - Add _is_allow_all_pattern() helper function - Recognize common allow-all patterns: .*, .+, ^.*$, ^.+$ - Reduces unnecessary filtering overhead 3. Add Security Tests: - Quote breakout SQL injection attempts - Backslash-quote escape bypass attempts - Multiple backslash edge cases - Full integration security test 4. Add Helper Function Tests: - TestEscapeForBigQueryString class - TestIsAllowAllPattern class
…00% code coverage Address security review feedback and improve code quality: Security Fixes: - Switch from raw strings (r'...') to regular string literals - Implement two-step escaping: backslashes first, then quotes - Add comprehensive security tests for SQL injection prevention Code Improvements: - Add _is_allow_all_pattern() helper for pattern detection - Use List[str] type hints instead of bare list - Add detailed security notes in docstrings - Enhance module-level docstring with test organization Test Coverage (100%): - Add TestFetchRegionQueryLogWithPushdown for integration tests - Cover pushdown_user_filter=True path (lines 410-413) - Cover pushdown_user_filter=False path (lines 414-416) - 55+ test cases across 6 test classes
…er_filter 1. Missing Test Coverage: - Add test_whitespace_nonwhitespace_star_is_allow_all for [\s\S]* - Add test_whitespace_nonwhitespace_plus_is_allow_all for [\s\S]+ - All 6 patterns in _is_allow_all_pattern() now have test coverage 2. User Documentation Enhancement: - Add comprehensive 'User Email Filtering Pushdown' section to bigquery_pre.md - Document when to use, example configuration, behavior, and prerequisites - Link from features list to new detailed section 3. Python 3.9 Compatibility Fix: - Fix parenthesized with statement syntax (Python 3.10+ only) - Use traditional 'with a, b:' syntax for Python 3.9 compatibility - This ensures TestFetchRegionQueryLogWithPushdown tests run on CI
Address CI test failures: 1. Fix failing tests: - Add ignoreCase=False to tests that check pattern translation logic - AllowDenyPattern defaults to ignoreCase=True which adds (?i) prefix - Tests now explicitly test pattern translation in isolation 2. Improve _is_allow_all_pattern() docstring: - List all 6 recognized 'allow all' patterns with descriptions - Document why multiple patterns are never considered 'allow all' 3. Add debug logging in _build_user_filter_from_pattern(): - Log input patterns at translation start - Log each pattern's escape transformation - Log when 'allow all' patterns are detected and skipped - Log final generated SQL filter 4. Add documentation note to test file: - Explain why most tests use ignoreCase=False - Reference dedicated case-sensitivity tests for maintainers
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ingestion
PR or Issue related to the ingestion of metadata
needs-review
Label for PRs that need review from a maintainer.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a new
pushdown_user_filterconfiguration option for BigQuery that pushesuser_email_patternfiltering directly to BigQuery's SQL query, reducing data transfer and improving performance for large query volumes.Changes
New Configuration Option
pushdown_user_filter: bool(default:False) toBigQueryV2ConfigandBigQueryQueriesExtractorConfiguser_email_pattern(Python regex) to BigQuery SQL usingREGEXP_CONTAINS()Implementation Details
_build_user_filter_from_pattern(): ConvertsAllowDenyPatternto BigQuery-compatible SQL WHERE clause_escape_for_bigquery_string(): SQL-injection-safe escaping (backslashes then quotes)_is_allow_all_pattern(): Detects common "match all" patterns to skip unnecessary filtering(?i)regex flag whenignoreCase=TrueSecurity
Documentation
bigquery_pre.mdExample Configuration
source:
type: bigquery
config:
use_queries_v2: true # Required for pushdown
pushdown_user_filter: true # Enable pushdown optimization
user_email_pattern:
allow:
- "analyst_.@example\.com"
deny:
- "bot_."### Testing
Checklist
ruff check,ruff format)