Alphafold in nextflow using azure batch #6843
Unanswered
venkatt007
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Azure Batch + nf-core/proteinfold: AlphaFold DB Files Always Staging (Even with blobfuse2 Mounts)
Hi all,
I’m running nf-core/proteinfold (v1.1.1) on Azure Batch using the azurebatch executor in closed private network, and I’m trying to prevent the massive AlphaFold database from being staged into the Azure Blob work directory on every run.
Despite using blobfuse2 mounts and setting stageInMode = 'symlink', the pipeline continues to stage DB files into az://.../work/stage-*. Files are properly mounted in the batch node as well.
I’m looking for confirmation of expected behavior and/or best practice for this architecture.
Environment
Nextflow: 25.10.x
nf-core/proteinfold: 1.1.1
Executor: azurebatch
Containers: Docker
Azure Storage: Blob storage mounted on compute nodes via blobfuse2
Database size: Multi-terabyte AlphaFold reference DB
Azure Batch Setup
Each Batch node has blobfuse2 mounts configured at:
/mnt/batch/tasks/fsmounts/input/mnt/batch/tasks/fsmounts/results/mnt/batch/tasks/fsmounts/work
Verified with:
blobfuse2 fuse 24G ... /mnt/batch/tasks/fsmounts/input
blobfuse2 fuse 24G ... /mnt/batch/tasks/fsmounts/results
blobfuse2 fuse 24G ... /mnt/batch/tasks/fsmounts/work
The AlphaFold DB is located under:
/mnt/batch/tasks/fsmounts/work/alphafolddb/alphafolddb
Goal
Avoid staging/copying the AlphaFold DB into:
az:///work/stage-/...
The DB already exists on mounted storage accessible to all nodes.
Configuration Attempt
nextflow config (simplified)
process {
executor = 'azurebatch'
stageInMode = 'symlink'
stageOutMode = 'rsync'
}
workDir = '/mnt/batch/tasks/fsmounts/work/work'
fusion.enabled = false
wave.enabled = false
tower.enabled = false
docker.enabled = true
params
input: "/mnt/batch/tasks/fsmounts/input/samplesheet.csv" outdir: "/mnt/batch/tasks/fsmounts/results/test1"alphafold2_db: "/mnt/batch/tasks/fsmounts/work/alphafolddb/alphafolddb"bfd_path: "/mnt/batch/tasks/fsmounts/work/alphafolddb/alphafolddb/bfd/*" ...
Observed Behavior
Even with:
Local POSIX paths only (no az://)
stageInMode = 'symlink'
Mounted storage on all nodes
The log still shows:
FilePorter - Copying foreign file /mnt/batch/tasks/fsmounts/work/alphafolddb/...
to work dir: az:///work/stage-/...
And on interruption:
port 4: (value) bound ; channel: bfd/* port 5: (value) bound ; channel: small_bfd/* port 6: (value) bound ; channel: mgnify/* ...
So it appears that the pipeline is materializing DB glob paths as path inputs, which forces Azure Batch localization via object storage staging.
What I’ve Tried
Using blobfuse2 mounts only (no Fusion)
Using Fusion instead of blobfuse
Mounting DB inside container with containerOptions
Overriding RUN_ALPHAFOLD2 module to use DB root directly
Using both az:// and POSIX-only configurations
The staging persists as long as DB-related parameters are passed as path inputs.
My Understanding (Please Confirm)
It seems that:
Azure Batch executor requires inputs to be localized into the remote workDir.
If a process declares path inputs (e.g., path('bfd/')), Nextflow treats them as managed inputs.
On Azure Batch, this results in uploading those files into the az://work/stage- area.
blobfuse mounts do not prevent this behavior.
The only way to avoid DB staging is:
Use Fusion with az:// paths, or
Refactor the pipeline so the DB is passed as a val string (not path inputs).
Is that correct?
Questions
Is there any supported way to prevent localization of large path inputs on Azure Batch when using mounted blob storage?
Has anyone successfully run nf-core/proteinfold on private Azure Batch with multi-TB AlphaFold DBs without massive staging overhead?
Any clarification or architectural recommendations would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions