Skip to content

Failing pipeline-level test on CI with Singularity #27

@CaroAMN

Description

@CaroAMN

Description of the bug

The pipeline-level test tests/default.nf.test fails when using Singularity. The test runs the main workflow with a minimal test dataset. Only on module in the main workflow, numorph3dunet, requires a GPU. The process is started but when it requests a GPU there is no GPU available. The process gets stuck until it reaches the time limit.

  • the test runs with profiles: test,singularity,gpu
  • the ci test requests GPU runners
  • the process cannot fall back on CPU ( it would need another Tensorflow version, for a real dataset you would not use CPUs since it would run forever)
  • The test for only this module modules/local/numorph3dunet/test/main.nf.test name: Numorph3DUnet test - tifpasses with singularity on CI
  • all GPU tests with singularity pass locally with the same nf-test command as used on CI
  • all GPU tests with docker pass locally and on CI using the same CI setup as the singularity CI tests

Command used and terminal output

Test pipeline

  Test [597de9c9] 'default test gpu' 
    > Nextflow 25.04.6 is available - Please consider updating your version to it
    > N E X T F L O W  ~  version 24.10.5
    > Launching `/home/runner/_work/lsmquant/lsmquant/tests/../main.nf` [intergalactic_bose] DSL2 - revision: 3dca561480
    > Downloading plugin nf-schema@2.4.2
    > 
    > ------------------------------------------------------
    >                                         ,--./,-.
    >         ___     __   __   __   ___     /,-._.--~'
    >   |\ | |__  __ /  ` /  \ |__) |__         }  {
    >   | \| |       \__, \__/ |  \ |___     \`-._,-`-,
    >                                         `._,._,'
    >   nf-core/lsmquant 1.0dev
    > ------------------------------------------------------
    > Input/output options
    >   input                       : https://raw.githubusercontent.com/nf-core/test-datasets/lsmquant//test_data/samplesheets/sample_sheet.csv
    >   outdir                      : /home/runner/_work/lsmquant/lsmquant/.nf-test/tests/597de9c99610fb6f57a51dd4f5e75fd1/output
    >   stage                       : full
    >   model_file                  : https://zenodo.org/records/16893708/files/075_121_model.h5
    > 
    > Institutional config options
    >   config_profile_name         : Test profile
    >   config_profile_description  : Minimal test dataset to check pipeline function
    > 
    > Generic options
    >   pipelines_testdata_base_path: https://raw.githubusercontent.com/nf-core/test-datasets/refs/heads/lsmquant
    >   trace_report_suffix         : 2025-09-05_10-42-02
    > 
    > Core Nextflow options
    >   runName                     : intergalactic_bose
    >   containerEngine             : singularity
    >   launchDir                   : /home/runner/_work/lsmquant/lsmquant/.nf-test/tests/597de9c99610fb6f57a51dd4f5e75fd1
    >   workDir                     : /home/runner/_work/lsmquant/lsmquant/.nf-test/tests/597de9c99610fb6f57a51dd4f5e75fd1/work
    >   projectDir                  : /home/runner/_work/lsmquant/lsmquant
    >   userName                    : runner
    >   profile                     : test,singularity,gpu
    >   configFiles                 : /home/runner/_work/lsmquant/lsmquant/nextflow.config, /home/runner/_work/lsmquant/lsmquant/nextflow.config, /home/runner/_work/lsmquant/lsmquant/tests/nextflow.config
    > 
    > !! Only displaying parameters that differ from the pipeline defaults !!
    > ------------------------------------------------------
    > * The nf-core framework
    >     https://doi.org/10.1038/s41587-020-0439-x
    > 
    > * Software dependencies
    >     https://github.com/nf-core/lsmquant/blob/master/CITATIONS.md
    > 
    > WARN: The following invalid input values have been detected:
    > 
    > * --modules_testdata_base_path: https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/
    > 
    > 
    > WARN: Access to undefined parameter `genomes` -- Initialise it to a default value eg. `params.genomes = some_value`
    > Staging foreign file: https://zenodo.org/records/14916478/files/ctip2_topro.zip
    > Pulling Singularity image https://depot.galaxyproject.org/singularity/p7zip:16.02 [cache /home/runner/_work/lsmquant/lsmquant/.singularity/depot.galaxyproject.org-singularity-p7zip-16.02.img]
    > [66/0321e7] Submitted process > NFCORE_LSMQUANT:LSMQUANT:UNZIP (ctip2_topro.zip)
    > Pulling Singularity image docker://quay.io/carolinschwitalla/numorph_preprocessing:0.9.0 [cache /home/runner/_work/lsmquant/lsmquant/.singularity/quay.io-carolinschwitalla-numorph_preprocessing-0.9.0.img]
    > [50/e1f123] Submitted process > NFCORE_LSMQUANT:LSMQUANT:NUMORPH_PREPROCESSING:NUMORPHINTENSITY (TEST1)
    > Pulling Singularity image docker://quay.io/carolinschwitalla/mat2json:latest [cache /home/runner/_work/lsmquant/lsmquant/.singularity/quay.io-carolinschwitalla-mat2json-latest.img]
    > [b5/10209e] Submitted process > NFCORE_LSMQUANT:LSMQUANT:NUMORPH_PREPROCESSING:NUMORPHALIGN (TEST1)
    > [61/e6793f] Submitted process > NFCORE_LSMQUANT:LSMQUANT:NUMORPH_PREPROCESSING:MAT2JSON_INT (TEST1)
    > [a8/4e70ed] Submitted process > NFCORE_LSMQUANT:LSMQUANT:NUMORPH_PREPROCESSING:MAT2JSON_ALIGN (TEST1)
    > [3e/7be5d3] Submitted process > NFCORE_LSMQUANT:LSMQUANT:NUMORPH_PREPROCESSING:NUMORPHSTITCH (TEST1)
    > [fe/3a63d4] Submitted process > NFCORE_LSMQUANT:LSMQUANT:NUMORPH_PREPROCESSING:MAT2JSON_STITCH (TEST1)
    > Staging foreign file: https://zenodo.org/records/16893708/files/075_121_model.h5
    > Pulling Singularity image docker://quay.io/carolinschwitalla/numorph-3dunet:latest [cache /home/runner/_work/lsmquant/lsmquant/.singularity/quay.io-carolinschwitalla-numorph-3dunet-latest.img]
    > [66/725962] Submitted process > NFCORE_LSMQUANT:LSMQUANT:NUMORPH3DUNET (TEST1)
    > ERROR ~ Error executing process > 'NFCORE_LSMQUANT:LSMQUANT:NUMORPH3DUNET (TEST1)'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   source /opt/conda/etc/profile.d/conda.sh
    >   conda activate 3dunet
    >   
    >   echo "GPU devices:"
    >   ls -lha /dev/nvidia* || echo "No nvidia devices found"
    >   
    >   echo "Checking GPU access:"
    >   nvidia-smi || echo "No nvidia-smi found"
    >   
    >   mkdir -p ./results
    >   mkdir -p ./images
    >   
    >   # move images to images directory
    >   mv TEST1_0001_C1_topro_stitched.tif TEST1_0001_C2_ctip2_stitched.tif TEST1_0002_C1_topro_stitched.tif TEST1_0002_C2_ctip2_stitched.tif TEST1_0003_C1_topro_stitched.tif TEST1_0003_C2_ctip2_stitched.tif TEST1_0004_C1_topro_stitched.tif TEST1_0004_C2_ctip2_stitched.tif TEST1_0005_C1_topro_stitched.tif TEST1_0005_C2_ctip2_stitched.tif TEST1_0006_C1_topro_stitched.tif TEST1_0006_C2_ctip2_stitched.tif TEST1_0007_C1_topro_stitched.tif TEST1_0007_C2_ctip2_stitched.tif TEST1_0008_C1_topro_stitched.tif TEST1_0008_C2_ctip2_stitched.tif TEST1_0009_C1_topro_stitched.tif TEST1_0009_C2_ctip2_stitched.tif TEST1_0010_C1_topro_stitched.tif TEST1_0010_C2_ctip2_stitched.tif TEST1_0011_C1_topro_stitched.tif TEST1_0011_C2_ctip2_stitched.tif TEST1_0012_C1_topro_stitched.tif TEST1_0012_C2_ctip2_stitched.tif TEST1_0013_C1_topro_stitched.tif TEST1_0013_C2_ctip2_stitched.tif TEST1_0014_C1_topro_stitched.tif TEST1_0014_C2_ctip2_stitched.tif TEST1_0015_C1_topro_stitched.tif TEST1_0015_C2_ctip2_stitched.tif TEST1_0016_C1_topro_stitched.tif TEST1_0016_C2_ctip2_stitched.tif TEST1_0017_C1_topro_stitched.tif TEST1_0017_C2_ctip2_stitched.tif TEST1_0018_C1_topro_stitched.tif TEST1_0018_C2_ctip2_stitched.tif ./images/
    >   
    >   numorph_3dunet.predict \
    >       -i images/ \
    >       -o results \
    >       --model_file 075_121_model.h5 \
    >       --sample_id TEST1 \
    >   
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_LSMQUANT:LSMQUANT:NUMORPH3DUNET":
    >       numorph3dunet: 1.0
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   GPU devices:
    >   No nvidia devices found
    >   Checking GPU access:
    >   NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
    >   
    >   No nvidia-smi found
    >   Namespace(acquired_img_resolution=[0.75, 0.75, 4], chunk_overlap=[16, 16, 8], chunk_size=[112, 112, 32], gpu=0, i='images/', int_threshold=200, mask_file='', measure_coloc=False, model_file='075_121_model.h5', n_channels=None, normalize_intensity=True, o='results', p='', pred_threshold=0.5, resample_chunks=False, resample_resolution=25, sample_id='TEST1', trained_img_resolution=[0.75, 0.75, 2.5], tree_radius=2, use_mask=False)
    >   Saving results to:  results/TEST1.csv
    >   Loading pre-trained model
    >   Namespace(acquired_img_resolution=[0.75, 0.75, 4], chunk_overlap=[16, 16, 8], chunk_size=[112, 112, 32], gpu=0, i='images/', int_threshold=200, mask_file='', measure_coloc=False, model_file='075_121_model.h5', n_channels=None, normalize_intensity=True, o='results', p='', pred_threshold=0.5, resample_chunks=False, resample_resolution=25, sample_id='TEST1', trained_img_resolution=[0.75, 0.75, 2.5], tree_radius=2, use_mask=False)
    >   Mask resolution:  [25 25 25]
    >   Working on chunk 1 out of 1
    >   Reading slices 0 through 18
    >   Padding Chunk End...
    >   Rescaling Intensity...
    >   Images prepared in:  0:00:14.000365
    > 
    > Command error:
    >   INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
    >   INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
    >   ls: cannot access '/dev/nvidia*': No such file or directory
    >   Using TensorFlow backend.
    >   2025-09-05 11:10:35.713262: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    >   2025-09-05 11:10:35.749317: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
    >   2025-09-05 11:10:35.750561: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:157] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
    > 
    > Work dir:
    >   /home/runner/_work/lsmquant/lsmquant/.nf-test/tests/597de9c99610fb6f57a51dd4f5e75fd1/work/66/7259629abe57c02931f64d6ed95d74
    > 
    > Container:
    >   /home/runner/_work/lsmquant/lsmquant/.singularity/quay.io-carolinschwitalla-numorph-3dunet-latest.img
    > 
    > Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
    > 
    >  -- Check '/home/runner/_work/lsmquant/lsmquant/.nf-test/tests/597de9c99610fb6f57a51dd4f5e75fd1/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting
    > 
    >  -- Check '/home/runner/_work/lsmquant/lsmquant/.nf-test/tests/597de9c99610fb6f57a51dd4f5e75fd1/meta/nextflow.log' file for details

Relevant files

System information

  • NXF_VER: 24.10.5
  • Apptainer: 1.3.6/x64
  • NFT_VER: 0.9.2
  • RUNS_ON_AMI_NAME: runs-on-v2.2-ubuntu24-gpu-x64-20250829135318
  • RUNS_ON_AWS_REGION: eu-west-1
  • RUNS_ON_AWS_AZ: eu-west-1a

logs:
https://github.com/nf-core/lsmquant/actions/runs/17490890020/job/49680288340?pr=26

PR:
#26

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions