Skip to content

Best practices for large-scale workflows with GCS storage and Google Batch executor #64

@mat10d

Description

@mat10d

Environment

Question

I'm running a large-scale image analysis workflow (14,368 jobs) on Google Batch with all data stored on GCS. I'm experiencing severe overhead from file existence checks during both DAG building and job scheduling phases that makes the workflow impractical to run.

Current observations at full scale (14K jobs):

  • DAG building: Takes multiple minutes (estimated~60 minutes minutes based on file check patterns). DAG build takes < 30 seconds on a single VM, if the files are stored on that VM
  • Job scheduling overhead: Continuous file existence checks throughout execution. Selection of 500 jobs, for example takes __ minutes.
  • Individual jobs: Complete quickly (minutes), but the workflow spends most time on overhead
  • All file paths are GCS URLs: Every file check requires a GCS API call (~100-200ms each)
  • Thousands of files: Each job has 8-32 input files, all requiring existence verification

The bottleneck: With thousands of jobs each checking multiple GCS files during both DAG construction and job selection, the file existence check overhead dominates workflow execution time.

My configuration:

default-storage-provider: gcs
default-storage-prefix: gs://scale1
not-retrieve-storage: false  # Check GCS during DAG build
shared-fs-usage:
  - persistence
  - source-cache
  - sources

Path structure question: My sample CSV files contain paths like:

qinling/input_ph/plate_1/round_1/0/A1_0_0_Fluorescence_405_nm_-_Penta.tiff

Which get combined with default-storage-prefix to create:

gs://scale1/qinling/input_ph/plate_1/round_1/0/A1_0_0_Fluorescence_405_nm_-_Penta.tiff

Question: Should I instead use:

  • CSV paths: input_ph/... (relative from project root)
  • Config: default-storage-prefix: gs://scale1/qinling
  • Result: Same final paths, but cleaner separation of project prefix and file structure

Does the current approach (including qinling/ in every CSV path) cause any issues or contribute to overhead?

Specific Questions

  1. Is not-retrieve-storage: false the right choice for GCS-heavy workflows?

    • Does this cause double-checking (once during DAG build, once during job scheduling)?
    • For workflows with 10K+ jobs, is it better to skip DAG-time checks and fail fast at runtime?
  2. Should shared-fs-usage be configured differently for Google Batch?

    • Batch uses ephemeral containers with no shared filesystem between jobs
    • Is listing source-cache and sources causing unnecessary checks?
  3. Have others experienced file existence check overhead at scale?

    • With thousands of GCS files, is there a recommended approach to minimize API calls?
    • Are there configuration settings specifically for large-scale remote storage workflows?
  4. What are the recommended settings for:

    • --latency-wait
    • --max-inventory-time
    • --max-status-checks-per-second
    • seconds-between-status-checks

    ...when all files are on remote storage?

  5. Is my path structure setup correct?

    • CSV files include project prefix: qinling/input_ph/...
    • Config has: default-storage-prefix: gs://scale1
    • Result: gs://scale1/qinling/input_ph/...

    Should I move qinling/ to the config instead (gs://scale1/qinling) and use purely relative paths in CSV files?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions