-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Hello,
apologies for opening a ticket that pertains more the Alphafold NIM, I am not seeing any way for feedback to its developers directly. But I am guessing you may have some insight.
We are trying to set up your pipeline in a HPC cluster multi-user environment, using Apptainer. It's working for the most part, but, I notice that the Alphafold NIMs take about 1/2 hour to start - after all the databases have been downloaded. The debug log level does not seem to help much, that 1/2 hour is in between of these log lines (notice the timestamps):
2025-04-30T14:24:34.034042Z INFO nim_hub_ngc::api::tokio::public: Skipping download, using cached copy of file: params/params_model_5.npz at path: "/opt/nim/.cache/ngc/hub/models--nim--deepmind--alphafold2-data/snapshots/1.2.0/params/params_model_5.npz"
"timestamp": "2025-04-30 08:24:34,034", "level": "INFO", "message": "Using the workspace specified during init: /opt/nim/workspace"
"timestamp": "2025-04-30 09:05:41,933", "level": "INFO", "message": "Starting NIM inference server"
******** main: environment variable and model artifact du report, begin
message=before create interface
NIM_CACHE_PATH=/opt/nim/.cache
Any idea what may be going on in there? The start_server process is pushing some CPU load, but in the tens of %. This may be an I/O potentially loading the databases, which are is on an InfiniBand attached VAST file system, but the node only shows a few Mbit/s network traffic which is way below the bandwidth.
Again, this is after all the NIM setup should be cached. RFDiffusion and ProteinMPNN NIM servers start up right away in the same setting.
Any thoughts/ideas are welcome.
Thanks.