Skip to content

feat: max_bytes_per_trigger, SSv5/v6 byte estimation, and batch naming#18

Merged
mdrakiburrahman merged 11 commits intomainfrom
dev/mdrrahman/more-fixes
Apr 10, 2026
Merged

feat: max_bytes_per_trigger, SSv5/v6 byte estimation, and batch naming#18
mdrakiburrahman merged 11 commits intomainfrom
dev/mdrrahman/more-fixes

Conversation

@mdrakiburrahman
Copy link
Copy Markdown
Contributor

@mdrakiburrahman mdrakiburrahman commented Apr 9, 2026

Why this change is needed

The adapter previously batched source files by count only (max_files_per_trigger). This is insufficient because SSv5/v6 structured stream files can vary wildly in actual data size — a small .ss manifest may reference a sibling folder with gigabytes of .du data files. Without byte-aware batching, a single batch could inadvertently submit far more data than intended to an ADLA job, causing timeouts or OOM failures.

Additionally, ADLA job names were not descriptive enough to tell at a glance which batch you were looking at and how many remained.

How

1. max_bytes_per_trigger config (default: 10 TB)

A new config option that works alongside max_files_per_trigger — whichever limit is hit first stops the batch. The file tracker now accepts both limits in get_next_batch(), and at least one file is always included to prevent infinite stalls when a single file exceeds the byte limit.

2. SSv5/v6 byte estimation

Before batching, each .ss file is enriched with an estimated_bytes value:

  • SSv3/v4: single .ss file — file size = data size
  • SSv5/v6: .ss manifest + sibling folder with .du files — adapter sums manifest + all .du file sizes

Detection is purely filesystem-based (check if sibling directory exists), no binary parsing. The AdlsGen1Client now has estimate_bytes() and enrich_with_estimates() methods.

3. Improved ADLA job naming

Job names now include batch_N_of_total format. The total batch count is pre-computed by simulating get_next_batch() across all unprocessed files before the batching loop starts.

Before: mon_sql_rg_history_incremental_batch5_1files
After: mon_sql_rg_history_incremental_batch_5_of_10_files_1

4. Constants and defaults refactoring

Extracted shared defaults into constants.py (Python) and defaults.sql (Jinja macro), eliminating magic numbers scattered across files.

5. Log noise reduction

  • Demoted FileNotFoundError from ADLS listings from warning to debug
  • Added a logging.Filter to suppress Azure SDK's verbose 404 ERROR logs (adapter already handles these)
  • Enriched debug output with estimatedBytes and contributingFiles columns in batch metadata tables

Test

  • 242 unit tests pass (uv run pytest tests/unit/ -v)
  • New test coverage for AdlsGen1Client.estimate_bytes(), enrich_with_estimates(), and byte-limited batching in FileTracker

@mdrakiburrahman mdrakiburrahman changed the title feat: maxBytesPerTrigger support fix: CI failures that popped up out of nowhere Apr 9, 2026
@mdrakiburrahman mdrakiburrahman changed the title fix: CI failures that popped up out of nowhere feat: maxBytesPerTrigger Apr 9, 2026
@mdrakiburrahman mdrakiburrahman changed the title feat: maxBytesPerTrigger feat: max_bytes_per_trigger, SSv5/v6 byte estimation, and batch naming Apr 9, 2026
@mdrakiburrahman mdrakiburrahman linked an issue Apr 10, 2026 that may be closed by this pull request
@mdrakiburrahman mdrakiburrahman marked this pull request as ready for review April 10, 2026 01:20
@mdrakiburrahman mdrakiburrahman merged commit cdd7e43 into main Apr 10, 2026
2 checks passed
@mdrakiburrahman mdrakiburrahman deleted the dev/mdrrahman/more-fixes branch April 10, 2026 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add maxBytesPerTrigger support

1 participant