feat: max_bytes_per_trigger, SSv5/v6 byte estimation, and batch naming#18
Merged
mdrakiburrahman merged 11 commits intomainfrom Apr 10, 2026
Merged
feat: max_bytes_per_trigger, SSv5/v6 byte estimation, and batch naming#18mdrakiburrahman merged 11 commits intomainfrom
mdrakiburrahman merged 11 commits intomainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this change is needed
The adapter previously batched source files by count only (
max_files_per_trigger). This is insufficient because SSv5/v6 structured stream files can vary wildly in actual data size — a small.ssmanifest may reference a sibling folder with gigabytes of.dudata files. Without byte-aware batching, a single batch could inadvertently submit far more data than intended to an ADLA job, causing timeouts or OOM failures.Additionally, ADLA job names were not descriptive enough to tell at a glance which batch you were looking at and how many remained.
How
1.
max_bytes_per_triggerconfig (default: 10 TB)A new config option that works alongside
max_files_per_trigger— whichever limit is hit first stops the batch. The file tracker now accepts both limits inget_next_batch(), and at least one file is always included to prevent infinite stalls when a single file exceeds the byte limit.2. SSv5/v6 byte estimation
Before batching, each
.ssfile is enriched with anestimated_bytesvalue:.ssfile — file size = data size.ssmanifest + sibling folder with.dufiles — adapter sums manifest + all.dufile sizesDetection is purely filesystem-based (check if sibling directory exists), no binary parsing. The
AdlsGen1Clientnow hasestimate_bytes()andenrich_with_estimates()methods.3. Improved ADLA job naming
Job names now include
batch_N_of_totalformat. The total batch count is pre-computed by simulatingget_next_batch()across all unprocessed files before the batching loop starts.Before:
mon_sql_rg_history_incremental_batch5_1filesAfter:
mon_sql_rg_history_incremental_batch_5_of_10_files_14. Constants and defaults refactoring
Extracted shared defaults into
constants.py(Python) anddefaults.sql(Jinja macro), eliminating magic numbers scattered across files.5. Log noise reduction
FileNotFoundErrorfrom ADLS listings fromwarningtodebuglogging.Filterto suppress Azure SDK's verbose 404 ERROR logs (adapter already handles these)estimatedBytesandcontributingFilescolumns in batch metadata tablesTest
uv run pytest tests/unit/ -v)AdlsGen1Client.estimate_bytes(),enrich_with_estimates(), and byte-limited batching inFileTracker