-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
THanks for providing stellar.
I am currently trying to run stellar on the Hubmap demo dataset on our Cluster. Although it states that it should finish quite fast, it runs >24h. I see that the GPU gets used, although just around 2.5 MB. I am not sure whats wrong. The loss also gets printed.
My environment:
Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
anndata 0.7.6 pypi_0 pypi
blas 1.0 mkl
blosc2 2.0.0 pypi_0 pypi
bottleneck 1.3.7 py38ha9d4c09_0
brotli-python 1.0.9 py38h6a678d5_8
ca-certificates 2024.9.24 h06a4308_0
certifi 2024.8.30 py38h06a4308_0
charset-normalizer 3.3.2 pyhd3eb1b0_0
contourpy 1.1.1 pypi_0 pypi
cudatoolkit 11.3.1 h2bc3f7f_2
cycler 0.12.1 pypi_0 pypi
cython 3.0.11 pypi_0 pypi
fonttools 4.54.1 pypi_0 pypi
h5py 3.11.0 pypi_0 pypi
idna 3.7 py38h06a4308_0
igraph 0.9.10 pypi_0 pypi
imageio 2.35.1 pypi_0 pypi
importlib-metadata 8.5.0 pypi_0 pypi
importlib-resources 6.4.5 pypi_0 pypi
intel-openmp 2023.1.0 hdb19cb5_46306
jinja2 3.1.4 py38h06a4308_0
joblib 1.4.2 py38h06a4308_0
kiwisolver 1.4.7 pypi_0 pypi
ld_impl_linux-64 2.40 h12ee557_0
legacy-api-wrap 1.4 pypi_0 pypi
libffi 3.4.4 h6a678d5_1
libgcc-ng 11.2.0 h1234567_1
libgfortran-ng 11.2.0 h00389a5_1
libgfortran5 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
libuv 1.48.0 h5eee18b_0
llvmlite 0.41.1 pypi_0 pypi
louvain 0.7.1 pypi_0 pypi
markupsafe 2.1.3 py38h5eee18b_0
matplotlib 3.6.3 pypi_0 pypi
mkl 2023.1.0 h213fc3f_46344
mkl-service 2.4.0 py38h5eee18b_1
mkl_fft 1.3.8 py38h5eee18b_0
mkl_random 1.2.4 py38hdb19cb5_0
msgpack 1.1.0 pypi_0 pypi
natsort 8.4.0 pypi_0 pypi
ncurses 6.4 h6a678d5_0
networkx 3.1 py38h06a4308_0
ninja 1.10.2 h06a4308_5
ninja-base 1.10.2 hd09550d_5
numba 0.58.1 pypi_0 pypi
numexpr 2.8.4 py38hc78ab66_1
numpy 1.22.4 pypi_0 pypi
openssl 3.0.15 h5eee18b_0
packaging 24.1 py38h06a4308_0
pandas 1.3.0 pypi_0 pypi
patsy 0.5.6 pypi_0 pypi
pillow 10.4.0 pypi_0 pypi
pip 24.2 py38h06a4308_0
platformdirs 3.10.0 py38h06a4308_0
pooch 1.7.0 py38h06a4308_0
py-cpuinfo 9.0.0 pypi_0 pypi
pyg 2.0.4 py38_torch_1.10.0_cu113 pyg
pynndescent 0.5.13 pypi_0 pypi
pyparsing 3.1.2 py38h06a4308_0
pysocks 1.7.1 py38h06a4308_0
python 3.8.20 he870216_0
python-dateutil 2.9.0post0 py38h06a4308_2
python-louvain 0.1 pypi_0 pypi
python-tzdata 2023.3 pyhd3eb1b0_0
pytorch 1.10.2 py3.8_cuda11.3_cudnn8.2.0_0 pytorch
pytorch-cluster 1.6.0 py38_torch_1.10.0_cu113 pyg
pytorch-mutex 1.0 cuda pytorch
pytorch-scatter 2.0.9 py38_torch_1.10.0_cu113 pyg
pytorch-sparse 0.6.13 py38_torch_1.10.0_cu113 pyg
pytorch-spline-conv 1.2.1 py38_torch_1.10.0_cu113 pyg
pytz 2024.1 py38h06a4308_0
pywavelets 1.4.1 pypi_0 pypi
pyyaml 6.0.1 py38h5eee18b_0
readline 8.2 h5eee18b_0
requests 2.32.3 py38h06a4308_0
scanpy 1.8.0 pypi_0 pypi
scikit-image 0.18.0 pypi_0 pypi
scikit-learn 1.0.2 pypi_0 pypi
scipy 1.7.0 pypi_0 pypi
seaborn 0.13.2 pypi_0 pypi
setuptools 75.1.0 py38h06a4308_0
sinfo 0.3.4 pypi_0 pypi
six 1.16.0 pyhd3eb1b0_1
sqlite 3.45.3 h5eee18b_0
statsmodels 0.14.1 pypi_0 pypi
stdlib-list 0.10.0 pypi_0 pypi
tables 3.8.0 pypi_0 pypi
tbb 2021.8.0 hdb19cb5_0
texttable 1.7.0 pypi_0 pypi
threadpoolctl 3.5.0 py38h2f386ee_0
tifffile 2023.7.10 pypi_0 pypi
tk 8.6.14 h39e8969_0
tqdm 4.66.5 py38h2f386ee_0
typing_extensions 4.11.0 py38h06a4308_0
umap-learn 0.5.6 pypi_0 pypi
urllib3 2.2.3 py38h06a4308_0
wheel 0.44.0 py38h06a4308_0
xlrd 1.2.0 pypi_0 pypi
xz 5.4.6 h5eee18b_1
yacs 0.1.6 pyhd3eb1b0_1
yaml 0.2.5 h7b6447c_0
zipp 3.20.2 pypi_0 pypi
zlib 1.2.13 h5eee18b_1
My slurm file
#!/bin/sh
#SBATCH --job-name="STELLAR_demo_2_241002"
#SBATCH --partition=gpu-single
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=16
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --mem=350gb
module load devel/cuda
module load devel/miniconda/3
source $MINICONDA_HOME/etc/profile.d/conda.sh
conda activate stellar
cd /gpfs/bwfor/work/ws/hd_bm327-phenotyping_benchmark/stellar/
conda run -n stellar python STELLAR_run.py --dataset Hubmap --num-heads 23
This is the GPU usage
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:8A:00.0 Off | 0 |
| N/A 31C P0 71W / 400W | 2371MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3981110 C python 2362MiB |
+-----------------------------------------------------------------------------------------+
I have not changed any of the scripts. DOes anyone have a suggestions?
Metadata
Metadata
Assignees
Labels
No labels