-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Environment
- Snakemake version: 9.12.0
- Executor:
snakemake-executor-plugin-googlebatch0.6.0 (https://github.com/mboulton-fathom/snakemake-executor-plugin-googlebatch/tree/feat/allow-custom-containers) - Storage:
snakemake-storage-plugin-gcs1.1.4 - Workflow scale: 14,368 jobs
- All data on GCS: Input and output files at
gs://scale1/qinling/
Question
I'm running a large-scale image analysis workflow (14,368 jobs) on Google Batch with all data stored on GCS. I'm experiencing severe overhead from file existence checks during both DAG building and job scheduling phases that makes the workflow impractical to run.
Current observations at full scale (14K jobs):
- DAG building: Takes multiple minutes (estimated~60 minutes minutes based on file check patterns). DAG build takes < 30 seconds on a single VM, if the files are stored on that VM
- Job scheduling overhead: Continuous file existence checks throughout execution. Selection of 500 jobs, for example takes __ minutes.
- Individual jobs: Complete quickly (minutes), but the workflow spends most time on overhead
- All file paths are GCS URLs: Every file check requires a GCS API call (~100-200ms each)
- Thousands of files: Each job has 8-32 input files, all requiring existence verification
The bottleneck: With thousands of jobs each checking multiple GCS files during both DAG construction and job selection, the file existence check overhead dominates workflow execution time.
My configuration:
default-storage-provider: gcs
default-storage-prefix: gs://scale1
not-retrieve-storage: false # Check GCS during DAG build
shared-fs-usage:
- persistence
- source-cache
- sourcesPath structure question: My sample CSV files contain paths like:
qinling/input_ph/plate_1/round_1/0/A1_0_0_Fluorescence_405_nm_-_Penta.tiff
Which get combined with default-storage-prefix to create:
gs://scale1/qinling/input_ph/plate_1/round_1/0/A1_0_0_Fluorescence_405_nm_-_Penta.tiff
Question: Should I instead use:
- CSV paths:
input_ph/...(relative from project root) - Config:
default-storage-prefix: gs://scale1/qinling - Result: Same final paths, but cleaner separation of project prefix and file structure
Does the current approach (including qinling/ in every CSV path) cause any issues or contribute to overhead?
Specific Questions
-
Is
not-retrieve-storage: falsethe right choice for GCS-heavy workflows?- Does this cause double-checking (once during DAG build, once during job scheduling)?
- For workflows with 10K+ jobs, is it better to skip DAG-time checks and fail fast at runtime?
-
Should
shared-fs-usagebe configured differently for Google Batch?- Batch uses ephemeral containers with no shared filesystem between jobs
- Is listing
source-cacheandsourcescausing unnecessary checks?
-
Have others experienced file existence check overhead at scale?
- With thousands of GCS files, is there a recommended approach to minimize API calls?
- Are there configuration settings specifically for large-scale remote storage workflows?
-
What are the recommended settings for:
--latency-wait--max-inventory-time--max-status-checks-per-secondseconds-between-status-checks
...when all files are on remote storage?
-
Is my path structure setup correct?
- CSV files include project prefix:
qinling/input_ph/... - Config has:
default-storage-prefix: gs://scale1 - Result:
gs://scale1/qinling/input_ph/...
Should I move
qinling/to the config instead (gs://scale1/qinling) and use purely relative paths in CSV files? - CSV files include project prefix: