I believe this is similar to how vLLM reserves actual resources at startup. According to the paper, no actual resources should be allocated at startup; the CUDA interface should only be invoked to allocate resources when processing inference requests.
