perf(ingestion/mode): Parallelize report processing across all spaces#16408
Merged
askumar27 merged 11 commits intoMar 12, 2026
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract inline dataset processing from _emit_workunits_for_space() into its own method with the same error-isolation pattern as _process_report(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace sequential space processing with a global thread pool. Reports from all spaces are now processed in a single ThreadedIteratorExecutor, eliminating the straggler problem where threads sat idle waiting for slow reports in one space before moving to the next. Also removes _clear_sql_parsing_caches() since the underlying LRU caches are already bounded at maxsize=1000 via functools.lru_cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…itecture Update TestDatasetErrorIsolation to call _process_dataset() directly instead of the removed _emit_workunits_for_space() method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
Linear: ING-1803 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Member
Connector Tests ResultsConnector tests failed for commit Autogenerated by the connector-tests CI pipeline. |
…ecutor Offload CPU-bound sqlglot parsing to a process pool while keeping I/O (API calls) in threads. Each worker process creates its own DataHubGraph connection and SchemaResolver cache. Gracefully degrades to in-thread parsing if pool creation fails or a worker crashes (BrokenProcessPool). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Import BrokenProcessPool from concurrent.futures.process - Declare _sql_parse_pool attribute type in __init__ to avoid no-redef Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…copy=False The copy=False optimization (commit d5a1311) eliminated deepcopy calls, but the only cooperate() call site was inside the __deepcopy__ wrapper. This made the 10-second SQL lineage timeout completely dead. Add explicit cooperate() calls at key iteration points in sqlglot_lineage.py and add a timeout to future.result() in mode.py for the ProcessPoolExecutor path where contextvars-based cooperative timeout doesn't cross process boundaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_list_joins used Expression.find_all(Join) which descends into CTE definitions, causing joins to be re-processed in the outer scope after already being handled in their own CTE scopes. This triggered redundant to_node() calls with expensive AST deep-copy/serialize/re-parse cycles. Replace with find_all_in_scope() which stops at scope boundaries. For a query with 4 CTEs and 11+ JOINs this eliminates 79% of to_node() calls (7,206 -> 1,538) and gives a 5x speedup in _list_joins. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_tables Replace expensive to_node() calls in _get_join_side_tables with direct scope.sources resolution. This avoids the costly deepcopy -> sql() -> parse_one() cycle for each join side, while keeping the precise to_node() approach for ON clause column resolution via intersection filtering. Key changes: - New _collect_tables_from_scope() walks scope.sources recursively through SUBQUERY/DERIVED_TABLE scopes to find physical Table nodes - Smart CTE handling: only follows CTE sources actually referenced in FROM/JOIN clauses (avoids sibling CTE over-inclusion from sqlglot) - Hoists FROM clause resolution outside the per-join loop - Skips UDTF scopes (LATERAL/UNNEST) which include correlated references Benchmark on production 27-min query: 44.4s -> 7.5s (5.9x speedup). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
Your PR has been assigned to tamas for review (ING-1803). |
treff7es
approved these changes
Mar 6, 2026
…oolExecutor Each worker process re-initialized its own DataHubGraph connection and SchemaResolver, wasting memory without proportional benefit. Revert to inline SQL parsing using the main process's shared SchemaResolver, which benefits from LRU cache hits across queries. Thread-based cross-space parallelization (ThreadPoolExecutor for I/O) and cooperative timeout fixes in sqlglot_lineage.py are preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
askumar27
added a commit
that referenced
this pull request
Mar 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Major performance overhaul of the Mode ingestion connector: cross-space parallelized report processing, multi-process SQL parsing, rate limiting, and numerous bug fixes for definition expansion, lineage extraction, and memory usage.
Motivation
The Mode connector had several performance and correctness issues for large workspaces:
lru_cache(maxsize=None)on_get_request_jsoncached every API response forever, causing OOM on large workspaces.sqlglotlineage computation held the GIL, preventing concurrent API calls.copy=Falseoptimization (commitd5a1311a50) eliminated alldeepcopycalls, butcooperate()— which enforces the 10-second SQL lineage timeout — only ran inside the__deepcopy__wrapper. SQL lineage could run indefinitely.--comments swallowing closing parens.Changes Overview
Architecture: Cross-space parallelization
Three-phase processing model:
(space_token, report/dataset)tuples into flat lists.ThreadedIteratorExecutorpool — no more per-space thread pools.This eliminates the straggler problem — threads are always busy processing the next available report regardless of which space it belongs to.
Multi-process SQL parsing
sqlglotlineage parsing offloaded to aProcessPoolExecutor(spawncontext) to bypass the GILSchemaResolverandDataHubGraphconnections via_init_sql_parse_workercpu_count - 1, capped atmax_threadsRate limiting
RateLimiter(api_options.requests_per_minute, default 180)RateLimitersleep moved outside the lock to reduce thread contentionSQL parsing performance (sqlglot_lineage.py)
_get_join_side_tables: Direct scope source walking instead of expensiveto_node()column-lineage approachfind_all_in_scopeinstead offind_allto avoid descending into subqueries_get_raw_col_upstreams_for_expressioncopy=Falseinto_node(): Avoids full AST deep copy during column-level lineagecooperate()calls at 4 hot-loop points (column loop, tree walk,to_node()call, scope traversal)Bug fixes
--comments from swallowing it@lru_cache(maxsize=None)on_get_request_json; replaced with bounded caches and manual cachingthreading.LocktoModeSourceReportfor counter increments andreport_warning/report_failure_is_http_404helper checkserror.responsefor None before accessing.status_codeHTTPError429/HTTPError504now passresponse=kwargOther improvements
explorations_count == 0, chart API calls are skipped entirelyexclude_personal_collectionsconfig (defaultTrue): Server-side filtering via Mode's?filter=customitems_per_pagevalidation: Range expanded from1-30to1-1000(Mode API supports up to 1000)_get_last_query_runmethodemit_dashboard_mces/emit_chart_mces/emit_dataset_mcesinto_process_reportand_process_datasetThreadedIteratorExecutor improvements
stop_eventfor cooperative shutdownDependencies
cachetoolsto Mode connector's dependency setImpact Assessment
lru_cacheon all API responses)Affected Components: Mode ingestion source,
sqlglot_lineage.py,ThreadedIteratorExecutor,RateLimiterBreaking Changes: None —
max_threads=1(default) preserves sequential behavior.exclude_personal_collections=True(new default) changes the API filter from?filter=allto?filter=custom; set toFalseto restore old behavior.Risk Level: Medium — substantial refactoring of processing loop, but guarded by fallback paths (single-thread mode, in-process SQL parsing) and comprehensive test coverage.
Test plan
./gradlew :metadata-ingestion:lintFix— ruff + mypy passtests/unit/sql_parsing/), including new CTE chain join tables testtests/unit/test_mode_source.py) — definition expansion,_is_http_404, thread-safe report, timestamp parsingtests/integration/mode/test_mode_threading.py) — cross-space parallelization, single-threaded fallback, dataset error isolation, chart skip optimization🤖 Generated with Claude Code