| Author | Joongi Kim (joongi@lablup.com) |
|---|---|
| Status | Draft |
| Created | 2025-07-09 |
| Created-Version | |
| Target-Version | |
| Implemented-Version |
In Backend.AI, model deployments load the model parameters (checkpoints) and launch configurations from a remote filesystem mounted as "vfolders". This is convenient for organizing and centralizing RBAC of model vfolders, but it may become a source of instability due to high-volume I/O when there are multiple containers loading models at the same time. We also observed many mountpoint loss issues in production by whatever network/filesystem issues.
The model deployments should be VERY STABLE because it would directly affect the customer's sales. In this sense, @inureyes has proposed having a local cache for model vfolders used in model deployments.
-
Explicit file copy: let the Backend.AI agent copy the model vfolder contents upon container startup.
- It incurs an explicit delay on the model service launches. We may need to add some kind of progress tracking before marking the model deployment is "healthy".
-
Caching FUSE: insert a fuse filesystem like catfs or mcachefs to transparently copy the target files into a local directory when reading the file for the first time.
-
Since there are multiple such "caching" fuse implementations, we need to investigate which one would work better for us, in terms of performance and stability.
-
It would be simpler to implement, but requires updates on the APC team's deplyoment processes and installation tools.
-
Currently we have a recommended setting for fsc following the cachefilesd configuration of NVIDIA's DGX OS.
Note that cachefilesd is a user-level service that interacts with the kernel-level fsc layer.
Though, it only works for remote filesystems like NFS mounts and cannot be used to provide caching of FUSE-based filesystems like s3fs.
In combination with the object storage support (BA-255 or #665), I'd prefer the FUSE-based approach using catfs.
-
Do we need to pre-load all files by trying to
open()when starting the container or just do on-demand loading? -
Does
catfsensure cleaning up of cached data when unmounting?
catfshas a configuration to keep the free space of the cache volume as percentage. It means thatcatfsmay evict some files when it reaches the space limit. Due to possibility of eviction, it DOES NOT guarantee that all model vfolder mounts are fully cached all the time.
-
By default, the mount caching should only be enabled for the main model vfolder of a model deployment container.
-
We could add a per-user vfolder-access record attribute to control whether to apply local caching (with warnings and descriptions about non-POSIX I/O behaviors).