Skip to content

Extreme memory use during training & included training script failed #170

@fgdfgfthgr-fox

Description

@fgdfgfthgr-fox

These are a minimalist training script in pure PyTorch, and its corresponding datasplit.csv:
minimal_script_pure_pytorch.py
datasplit.csv

It trains on pure ecs class, with an array size of 128x128x128, at batch size of 8, using 8 workers. During training, the CPU memory use increases rapidly during each training step, easily raises up to over 50GB, which seems way too much (consider even the whole dataset is just 37GB in size). And I recall in previous versions it didn't take that much memory.

Similarly, when I try to run the included train_3D.py example script, I get this:

...
Processing D:\DanielHuang\cellmap\cellmap-segmentation-challenge\data\jrc_mus-kidney\jrc_mus-kidney.zarr\recon-1\labels\groundtruth\crop221
Number of datasets: 245
Number of training datasets: 209 (85.31%)
Number of validation datasets: 36 (14.69%)
CSV written to datasplit.csv
100%|██████████| 289/289 [00:00<00:00, 1545.89it/s, No classes found]
Training datasets: 100%|██████████| 209/209 [00:04<00:00, 46.70it/s]
Validation datasets: 100%|██████████| 36/36 [00:00<00:00, 43.89it/s]
Training datasets: 100%|██████████| 209/209 [00:46<00:00,  4.46it/s]
Validation datasets: 100%|██████████| 36/36 [00:03<00:00, 11.64it/s]
100%|██████████| 208/208 [01:39<00:00,  2.09it/s]
Training 3d_vnet for 1000 epochs, starting at epoch 1, iteration 0...
Training:   0%|          | 0/1000 [00:00<?, ?it/s]
Process finished with exit code -1073740791 (0xC0000409)

From a quick search, 0xC0000409 seems to mean STATUS_STACK_BUFFER_OVERRUN, which I suspect is something to do with memory use as well.
If I reduce the shape of input_array_info and target_array_info to 64x64x64 (from 128x128x128), and reduce the batch size to 2 (from 8), it no longer output 0xC0000409 directly, but I still observe extreme memory spike.
Image

I am using the latest repository #b618dfd
My environment:

pip list
Package                        Version        Editable project location
------------------------------ -------------- -----------------------------------------------------
absl-py                        2.2.2
aiobotocore                    2.17.0
aiofiles                       24.1.0
aiohappyeyeballs               2.6.1
aiohttp                        3.11.18
aioitertools                   0.12.0
aiosignal                      1.3.2
annotated-types                0.7.0
anyio                          4.9.0
asciitree                      0.3.3
asttokens                      3.0.0
atomicwrites                   1.4.1
attrs                          25.3.0
blinker                        1.9.0
boto3                          1.35.81
botocore                       1.35.93
cachetools                     5.5.2
cellmap-data                   2025.7.24.1615
cellmap-flow                   0.1.3
cellmap-segmentation-challenge 0.0.1          D:\DanielHuang\cellmap\cellmap-segmentation-challenge
cellpose                       4.0.2
certifi                        2025.4.26
charset-normalizer             3.4.2
click                          8.2.0
cloudpickle                    3.1.1
cmake                          3.31.6
colorama                       0.4.6
contourpy                      1.3.2
cycler                         0.12.1
daisy                          1.2.2
dask                           2025.4.1
decorator                      5.2.1
Deprecated                     1.2.18
dill                           0.4.0
eval-type-backport             0.1.3
executing                      2.2.0
fastapi                        0.115.13
fasteners                      0.19
fastremap                      1.16.1
ffmpy                          0.6.0
filelock                       3.13.1
fill_voids                     2.0.8
flasgger                       0.9.7.1
Flask                          3.1.0
flask-cors                     5.0.1
flexcache                      0.3
flexparser                     0.4
fonttools                      4.58.0
frozenlist                     1.6.0
fsspec                         2024.6.1
funlib.geometry                0.3.0
funlib.math                    0.1
google-apitools                0.5.32
google-auth                    2.40.1
gradio                         5.34.2
gradio_client                  1.10.3
groovy                         0.1.2
grpcio                         1.71.0
gunicorn                       23.0.0
h11                            0.16.0
h5py                           3.13.0
httpcore                       1.0.9
httplib2                       0.22.0
httpx                          0.28.1
huggingface-hub                0.33.1
idna                           3.10
imagecodecs                    2025.3.30
imageio                        2.37.0
importlib_metadata             8.7.0
ipython                        9.3.0
ipython_pygments_lexers        1.1.1
itsdangerous                   2.2.0
jedi                           0.19.2
Jinja2                         3.1.4
jmespath                       1.0.1
joblib                         1.5.0
jsonschema                     4.23.0
jsonschema-specifications      2025.4.1
kiwisolver                     1.4.8
lazy_loader                    0.4
lightning                      2.5.1.post0
lightning-utilities            0.14.3
lit                            18.1.8
llvmlite                       0.44.0
locket                         1.0.0
Markdown                       3.8
markdown-it-py                 3.0.0
MarkupSafe                     2.1.5
marshmallow                    4.0.0
matplotlib                     3.10.3
matplotlib-inline              0.1.7
mdurl                          0.1.2
mistune                        3.1.3
ml_collections                 1.1.0
ml_dtypes                      0.5.1
mpmath                         1.3.0
multidict                      6.4.3
natsort                        8.4.0
networkx                       3.3
neuroglancer                   2.40.1
ninja                          1.11.1.4
numba                          0.61.2
numcodecs                      0.15.1
numpy                          2.1.2
oauth2client                   4.1.3
opencv-python-headless         4.11.0.86
orjson                         3.10.18
packaging                      24.2
pandas                         2.2.3
parso                          0.8.4
partd                          1.4.2
pillow                         11.0.0
Pint                           0.24.4
pip                            25.1.1
platformdirs                   4.3.8
prompt_toolkit                 3.0.51
propcache                      0.3.1
protobuf                       6.30.2
pure_eval                      0.2.3
pyasn1                         0.6.1
pyasn1_modules                 0.4.2
pybind11                       2.13.6
pydantic                       2.11.4
pydantic_core                  2.33.2
pydantic-ome-ngff              0.6.0
pydantic-zarr                  0.7.0
pydub                          0.25.1
Pygments                       2.19.2
pykdtree                       1.4.1
pyparsing                      3.2.3
pyreadline3                    3.5.4
python-dateutil                2.9.0.post0
python-dotenv                  1.1.0
python-multipart               0.0.20
pytorch-lightning              2.5.1.post0
pytz                           2025.2
PyYAML                         6.0.2
referencing                    0.36.2
requests                       2.32.3
rich                           14.0.0
roifile                        2025.5.10
rpds-py                        0.24.0
rsa                            4.9.1
ruff                           0.12.0
s3fs                           2024.6.1
s3transfer                     0.10.4
safehttpx                      0.1.6
safetensors                    0.5.3
scikit-dimension               0.3.4
scikit-image                   0.25.2
scikit-learn                   1.6.1
scipy                          1.15.3
segment-anything               1.0
semantic-version               2.10.0
setuptools                     65.5.0
shellingham                    1.5.4
six                            1.17.0
sniffio                        1.3.1
stack-data                     0.6.3
starlette                      0.46.2
structlog                      25.3.0
sympy                          1.13.3
tabulate                       0.9.0
tensorboard                    2.19.0
tensorboard-data-server        0.7.2
tensorboardX                   2.6.2.2
tensorstore                    0.1.74
threadpoolctl                  3.6.0
tifffile                       2025.5.10
tomlkit                        0.13.3
toolz                          1.0.0
torch                          2.7.1+cu128
torchaudio                     2.7.1+cu128
torchmetrics                   1.7.1
torchvision                    0.22.1+cu128
tornado                        6.4.2
tqdm                           4.67.1
traitlets                      5.14.3
triton-windows                 3.3.1.post19
typer                          0.16.0
typing_extensions              4.12.2
typing-inspection              0.4.0
tzdata                         2025.2
universal_pathlib              0.2.6
urllib3                        2.4.0
uvicorn                        0.34.3
wcwidth                        0.2.13
websockets                     15.0.1
Werkzeug                       3.1.3
wheel                          0.45.1
wrapt                          1.17.2
xarray                         2025.4.0
xarray-ome-ngff                3.1.1
xarray-tensorstore             0.1.5
yarl                           1.20.0
zarr                           2.18.4
zipp                           3.21.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions