feat: max_bytes_per_trigger, SSv5/v6 byte estimation, and batch naming by mdrakiburrahman · Pull Request #18 · microsoft/dbt-scope

mdrakiburrahman · 2026-04-09T14:34:39Z

Why this change is needed

The adapter previously batched source files by count only (max_files_per_trigger). This is insufficient because SSv5/v6 structured stream files can vary wildly in actual data size — a small .ss manifest may reference a sibling folder with gigabytes of .du data files. Without byte-aware batching, a single batch could inadvertently submit far more data than intended to an ADLA job, causing timeouts or OOM failures.

Additionally, ADLA job names were not descriptive enough to tell at a glance which batch you were looking at and how many remained.

How

1. `max_bytes_per_trigger` config (default: 10 TB)

A new config option that works alongside max_files_per_trigger — whichever limit is hit first stops the batch. The file tracker now accepts both limits in get_next_batch(), and at least one file is always included to prevent infinite stalls when a single file exceeds the byte limit.

2. SSv5/v6 byte estimation

Before batching, each .ss file is enriched with an estimated_bytes value:

SSv3/v4: single .ss file — file size = data size
SSv5/v6: .ss manifest + sibling folder with .du files — adapter sums manifest + all .du file sizes

Detection is purely filesystem-based (check if sibling directory exists), no binary parsing. The AdlsGen1Client now has estimate_bytes() and enrich_with_estimates() methods.

3. Improved ADLA job naming

Job names now include batch_N_of_total format. The total batch count is pre-computed by simulating get_next_batch() across all unprocessed files before the batching loop starts.

Before: mon_sql_rg_history_incremental_batch5_1files
After: mon_sql_rg_history_incremental_batch_5_of_10_files_1

4. Constants and defaults refactoring

Extracted shared defaults into constants.py (Python) and defaults.sql (Jinja macro), eliminating magic numbers scattered across files.

5. Log noise reduction

Demoted FileNotFoundError from ADLS listings from warning to debug
Added a logging.Filter to suppress Azure SDK's verbose 404 ERROR logs (adapter already handles these)
Enriched debug output with estimatedBytes and contributingFiles columns in batch metadata tables

Test

242 unit tests pass (uv run pytest tests/unit/ -v)
New test coverage for AdlsGen1Client.estimate_bytes(), enrich_with_estimates(), and byte-limited batching in FileTracker

mdrakiburrahman added 3 commits April 9, 2026 14:29

Don't log so much

b0b5d18

No pipe

6c8ce62

Hue?

3d80646

mdrakiburrahman changed the title ~~feat: maxBytesPerTrigger support~~ fix: CI failures that popped up out of nowhere Apr 9, 2026

mdrakiburrahman changed the title ~~fix: CI failures that popped up out of nowhere~~ feat: maxBytesPerTrigger Apr 9, 2026

mdrakiburrahman added 7 commits April 9, 2026 16:35

Factor out some constants

01848b7

Added trigger

5d60cb7

100 GB to 1000 GB

9fdc4a8

Noisy logs are noisy on list

a96d609

Increase default to 10 TB and pretty DataFrame logging

0451c18

Remove redundant

36a1c3c

Put the batch name in the job

9e69202

mdrakiburrahman changed the title ~~feat: maxBytesPerTrigger~~ feat: max_bytes_per_trigger, SSv5/v6 byte estimation, and batch naming Apr 9, 2026

Improve logging and fix the max bytes injection error

124a519

mdrakiburrahman linked an issue Apr 10, 2026 that may be closed by this pull request

feat: Add maxBytesPerTrigger support #17

Closed

mdrakiburrahman marked this pull request as ready for review April 10, 2026 01:20

mdrakiburrahman merged commit cdd7e43 into main Apr 10, 2026
2 checks passed

mdrakiburrahman deleted the dev/mdrrahman/more-fixes branch April 10, 2026 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: max_bytes_per_trigger, SSv5/v6 byte estimation, and batch naming#18

feat: max_bytes_per_trigger, SSv5/v6 byte estimation, and batch naming#18
mdrakiburrahman merged 11 commits intomainfrom
dev/mdrrahman/more-fixes

mdrakiburrahman commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mdrakiburrahman commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this change is needed

How

1. max_bytes_per_trigger config (default: 10 TB)

2. SSv5/v6 byte estimation

3. Improved ADLA job naming

4. Constants and defaults refactoring

5. Log noise reduction

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mdrakiburrahman commented Apr 9, 2026 •

edited

Loading

1. `max_bytes_per_trigger` config (default: 10 TB)