Best practices for large-scale workflows with GCS storage and Google Batch executor

## Environment
- **Snakemake version:** 9.12.0
- **Executor:** `snakemake-executor-plugin-googlebatch` 0.6.0 (https://github.com/mboulton-fathom/snakemake-executor-plugin-googlebatch/tree/feat/allow-custom-containers)
- **Storage:** `snakemake-storage-plugin-gcs` 1.1.4
- **Workflow scale:** 14,368 jobs
- **All data on GCS:** Input and output files at `gs://scale1/qinling/`

## Question

I'm running a large-scale image analysis workflow (14,368 jobs) on Google Batch with all data stored on GCS. I'm experiencing **severe overhead from file existence checks** during both DAG building and job scheduling phases that makes the workflow impractical to run.

**Current observations at full scale (14K jobs):**
- **DAG building:** Takes multiple minutes _(estimated~60 minutes minutes based on file check patterns). DAG build takes < 30 seconds on a single VM, if the files are stored on that VM_
- **Job scheduling overhead:** Continuous file existence checks throughout execution. Selection of 500 jobs, for example takes __ minutes.
- **Individual jobs:** Complete quickly (minutes), but the workflow spends most time on overhead
- **All file paths are GCS URLs:** Every file check requires a GCS API call (~100-200ms each)
- **Thousands of files:** Each job has 8-32 input files, all requiring existence verification

**The bottleneck:** With thousands of jobs each checking multiple GCS files during both DAG construction and job selection, the file existence check overhead dominates workflow execution time.

**My configuration:**
```yaml
default-storage-provider: gcs
default-storage-prefix: gs://scale1
not-retrieve-storage: false  # Check GCS during DAG build
shared-fs-usage:
  - persistence
  - source-cache
  - sources
```

**Path structure question:** My sample CSV files contain paths like:
```
qinling/input_ph/plate_1/round_1/0/A1_0_0_Fluorescence_405_nm_-_Penta.tiff
```

Which get combined with `default-storage-prefix` to create:
```
gs://scale1/qinling/input_ph/plate_1/round_1/0/A1_0_0_Fluorescence_405_nm_-_Penta.tiff
```

**Question:** Should I instead use:
- CSV paths: `input_ph/...` (relative from project root)
- Config: `default-storage-prefix: gs://scale1/qinling`
- Result: Same final paths, but cleaner separation of project prefix and file structure

Does the current approach (including `qinling/` in every CSV path) cause any issues or contribute to overhead?

## Specific Questions

1. **Is `not-retrieve-storage: false` the right choice for GCS-heavy workflows?**
   - Does this cause double-checking (once during DAG build, once during job scheduling)?
   - For workflows with 10K+ jobs, is it better to skip DAG-time checks and fail fast at runtime?

2. **Should `shared-fs-usage` be configured differently for Google Batch?**
   - Batch uses ephemeral containers with no shared filesystem between jobs
   - Is listing `source-cache` and `sources` causing unnecessary checks?

3. **Have others experienced file existence check overhead at scale?**
   - With thousands of GCS files, is there a recommended approach to minimize API calls?
   - Are there configuration settings specifically for large-scale remote storage workflows?

4. **What are the recommended settings for:**
   - `--latency-wait`
   - `--max-inventory-time`
   - `--max-status-checks-per-second`
   - `seconds-between-status-checks`

   ...when all files are on remote storage?

5. **Is my path structure setup correct?**
   - CSV files include project prefix: `qinling/input_ph/...`
   - Config has: `default-storage-prefix: gs://scale1`
   - Result: `gs://scale1/qinling/input_ph/...`

   Should I move `qinling/` to the config instead (`gs://scale1/qinling`) and use purely relative paths in CSV files?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practices for large-scale workflows with GCS storage and Google Batch executor #64

Environment

Question

Specific Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Best practices for large-scale workflows with GCS storage and Google Batch executor #64

Description

Environment

Question

Specific Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions