Skip to content

fix: migrate to AdapterLogger, separate extract/delta columns, fix INSERT order, parallelize ADLS walk#15

Merged
mdrakiburrahman merged 6 commits intomainfrom
dev/mdrrahman/perf-improvements
Apr 8, 2026
Merged

fix: migrate to AdapterLogger, separate extract/delta columns, fix INSERT order, parallelize ADLS walk#15
mdrakiburrahman merged 6 commits intomainfrom
dev/mdrrahman/perf-improvements

Conversation

@mdrakiburrahman
Copy link
Copy Markdown
Contributor

@mdrakiburrahman mdrakiburrahman commented Apr 8, 2026

Why this change is needed

Several issues were discovered during production pipeline testing:

  1. Noisy logging — The adapter used logging.getLogger(__name__) which bypassed dbt's log routing, flooding stdout with internal debug noise. Operators couldn't distinguish adapter messages from dbt's own output.
  2. Extract vs. Delta column conflation — A single scope_columns list (with an extract: bool flag) was used for both CREATE TABLE and EXTRACT clauses. This breaks when source file columns differ from the final Delta table schema (e.g. computed/derived columns, renamed columns).
  3. INSERT column-order mismatchINSERT INTO @target SELECT * FROM @batch_data assumed positional column alignment between the user's SELECT and the Delta table schema. Any reordering caused silent data corruption or runtime failures.
  4. Slow ADLS Gen1 directory listing — Recursive _walk() was serial, blocking the adapter for minutes on deep directory trees.
  5. Per-model AU/priority not wired — Model-level au and priority config was parsed but never forwarded to the ADLA job submission.

How

1. Migrate logging to AdapterLogger (all adapter modules)

Replaced logging.getLogger(__name__) with AdapterLogger("scope") across all adapter modules (impl.py, connections.py, script_builder.py, checkpoint.py, adls_gen1_client.py, delta_lake.py, file_tracker.py, _file_lock.py, sqlglot_parser.py). Demoted most log.info() calls to log.debug() so routine operational detail only appears at debug verbosity. This funnels all adapter output through dbt's event system.

Added _pretty_print_file_batch() in impl.py — uses pandas to render a human-readable table of file metadata (timestamps → ISO-8601, sizes → human-readable) for debug-level batch logging.

2. Separate Delta table columns from Extract columns

Config rename: scope_columns → split into delta_table_columns (CREATE TABLE schema) + extract_columns (EXTRACT column list).

ScriptConfig: replaced columns: list[ColumnDef] with delta_columns and extract_columns. Removed ColumnDef.extract flag — no longer needed since the two column sets are explicit.

ScriptBuilder: _create_table() uses delta_columns, _extract_from_files() uses extract_columns, and _model_transform_and_insert() receives delta_columns for the explicit INSERT column list.

Macros: table.sql and incremental.sql read both config keys and pass them through separately. utils.sql helper macros renamed accordingly.

impl.py (build_script_config): parses delta_table_columns and extract_columns from model config independently.

3. Fix INSERT/SELECT column-order mismatch

_model_transform_and_insert() now takes delta_columns and emits:

INSERT INTO @target
SELECT col1, col2, col3 FROM @batch_data;

instead of SELECT *, guaranteeing column alignment with the Delta table schema regardless of the user's SELECT order.

The Jinja fallback path in table.sql also changed from SELECT * to SELECT {{ delta_table_columns | map(attribute='name') | join(', ') }}.

4. Parallel ADLS Gen1 directory walk

AdlsGen1Client._walk() rewritten to use ThreadPoolExecutor with concurrent.futures.wait(FIRST_COMPLETED). Each directory is listed in parallel (default max_workers=8), with per-directory timing logged at debug level. Zero-length files are now skipped. FileInfo gains a raw: dict field preserving the original ADLS entry for debug display.

list_relations_without_caching in impl.py also gains per-step timing instrumentation.

5. Per-model AU and priority support

  • ScopeConnectionHandle gains _next_job_au and _next_job_priority fields.
  • ScopeAdapter exposes set_next_job_au() and set_next_job_priority() as @available methods.
  • ScopeConnectionManager.execute() reads and clears these per-call overrides.
  • Both table.sql and incremental.sql macros call the new setters when the model config specifies au or priority.

6. README and integration test updates

  • README examples updated to use delta_table_columns / extract_columns syntax with partition_column_in_extract flag.
  • Integration test models (append_no_delete.sql, filtered_edition.sql) updated to use the new config keys.

Test

  • All existing unit tests pass with updated fixtures reflecting the new column config shape.
  • New unit tests added for AdlsGen1Client parallel walk (test_adls_gen1_client.py).
  • New unit tests for explicit INSERT column list generation (test_script_builder.py).
  • Integration tests updated to use the new delta_table_columns / extract_columns config.

mdrakiburrahman and others added 3 commits April 8, 2026 02:33
The previous commit (6e4a398) fixed _model_transform_and_insert in
script_builder.py, but the materializations use the Jinja macro
scope__build_file_based_script in table.sql — not the Python
ScriptBuilder. Apply the same explicit column list fix to the Jinja
template so INSERT INTO @target includes column names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e order

The previous fix added an explicit column list to INSERT but kept
SELECT *, which still emits columns in model SELECT order. Replace
SELECT * with SELECT col1, col2, ... in delta_columns (table definition)
order so the positional mapping is correct regardless of model SQL order.

Fixes both script_builder.py and the Jinja macro in table.sql.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mdrakiburrahman mdrakiburrahman changed the title fix: Improve logging by using adapter and several bug fixes fix: migrate to AdapterLogger, separate extract/delta columns, fix INSERT order, parallelize ADLS walk Apr 8, 2026
@mdrakiburrahman mdrakiburrahman marked this pull request as ready for review April 8, 2026 03:20
@mdrakiburrahman mdrakiburrahman merged commit 8d4c89c into main Apr 8, 2026
2 checks passed
@mdrakiburrahman mdrakiburrahman deleted the dev/mdrrahman/perf-improvements branch April 8, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: SCOPE Extractor tries to pull out columns with the same name as Delta bug: Use adapter logging not Python logging

1 participant