LUPINE is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
LUPINE lets you spin up a container with a virtual GPU, like connecting a Mac to a Linux GPU server.
% uname -mors
Darwin 25.5.0 arm64
% docker run --rm -it --network=host \
ghcr.io/lupinemachines/lupine-client:cuda-13.1.0-ubuntu24.04-slim \
bash -c 'apt-get update -qq && apt-get install -qq -y curl ca-certificates && curl -LsSf https://astral.sh/uv/install.sh | sh && ~/.local/bin/uv run https://raw.githubusercontent.com/lupinemachines/lupine/main/python/examples/tensor.py'
...
cuda available: True
device: cuda:0
count: 1
gpu: NVIDIA GeForce RTX 4090
result: [0.0, 2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0]Use the published GHCR images. The examples below pin CUDA 13.1.0 on
Ubuntu 24.04; other published tags use the same
cuda-<cuda-version>-ubuntu<ubuntu-version> format.
Run the server on the GPU machine:
docker run --rm --gpus all -p 14833:14833 \
ghcr.io/lupinemachines/lupine-server:cuda-13.1.0-ubuntu24.04Run the client pointing at that server:
docker run --rm -it \
-e LUPINE_SERVER=<server>:14833 \
ghcr.io/lupinemachines/lupine-client:cuda-13.1.0-ubuntu24.04 \
nvidia-smiExample output from a real run against a remote RTX 4090:
Mon May 18 15:40:46 2026
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.288.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off |
| 30% 52C P8 22W / 450W | 8MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Inside the client container, LD_LIBRARY_PATH=/opt/lupine/lib is already set,
so CUDA driver users pick up the LUPINE libcuda.so.1 shim and NVML users such
as nvidia-smi pick up the LUPINE libnvidia-ml.so.1 shim automatically.
The client accepts a comma-separated LUPINE_SERVER list. Devices are exposed as
one local ordinal list in server order: all GPUs from the first server, then all
GPUs from the next server, and so on.
Run a server on each GPU machine:
# on gpu-host-a
docker run --rm --gpus all -p 14833:14833 \
ghcr.io/lupinemachines/lupine-server:cuda-13.1.0-ubuntu24.04
# on gpu-host-b
docker run --rm --gpus all -p 14833:14833 \
ghcr.io/lupinemachines/lupine-server:cuda-13.1.0-ubuntu24.04Point the client at both servers:
docker run --rm --network host \
-e LUPINE_SERVER=gpu-host-a:14833,gpu-host-b:14833 \
ghcr.io/lupinemachines/lupine-client:cuda-13.1.0-ubuntu24.04 \
nvidia-smi -LExpected output lists both remote GPUs:
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-...)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-...)
CUDA driver applications use the same LUPINE_SERVER value:
docker run --rm --network host \
-e LUPINE_SERVER=gpu-host-a:14833,gpu-host-b:14833 \
ghcr.io/lupinemachines/lupine-client:cuda-13.1.0-ubuntu24.04 \
./your_cuda_programCross-server peer access and device-to-device copies are not implemented yet. Same-server operations route by handle ownership.
For a specific CUDA version:
docker pull ghcr.io/lupinemachines/lupine-client:cuda-12.4.1-ubuntu22.04
docker pull ghcr.io/lupinemachines/lupine-server:cuda-12.4.1-ubuntu22.04Client images are also published with a -slim tag, for example
ghcr.io/lupinemachines/lupine-client:cuda-13.1.0-ubuntu24.04-slim. The
default client tag keeps the CUDA runtime libraries for applications that link
against them; the slim tag includes only the LUPINE shims, their runtime
dependencies, and nvidia-smi.
This path derives a small PyTorch client image from the published LUPINE client
image and runs the microgpt_train test against a remote GPU. It is
intentionally explicit so it is easy to see which side is the CPU-only client
and which side owns the GPU.
Create a PyTorch client Dockerfile in the repo root:
# Dockerfile.pytorch-lupine
FROM ghcr.io/lupinemachines/lupine-client:cuda-13.1.0-ubuntu24.04
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
RUN pip3 install --break-system-packages \
--index-url https://download.pytorch.org/whl/cu130 \
torch
COPY test/pytorch_lupine_tests.py /opt/lupine/test/pytorch_lupine_tests.py
ENV LD_LIBRARY_PATH=/opt/lupine/lib:${LD_LIBRARY_PATH}
CMD ["python3", "/opt/lupine/test/pytorch_lupine_tests.py", "microgpt_train"]Build it:
docker build -f Dockerfile.pytorch-lupine -t lupine-pytorch:cuda-13.1 .Run the server on the GPU machine:
docker run --rm --gpus all -p 14833:14833 \
ghcr.io/lupinemachines/lupine-server:cuda-13.1.0-ubuntu24.04Run the PyTorch client from the CPU-only machine:
docker run --rm \
-e LUPINE_SERVER=<server>:14833 \
lupine-pytorch:cuda-13.1Expected success looks like:
microgpt first_loss=... last_loss=...
microgpt_train: PASS
Building the binaries requires running codegen first. Lupine codegen reads the cuda dependency header files in order to generate rpc calls.
To ensure codegen works properly, the proper cuda packages need to be installed on your OS. Take a look at our Dockerfile to see an example.
Take a look here to install CUDA Toolkit (choose your system)
Codegen requires cuBLAS, cuDNN, NVML, etc:
cudnn_graph_header = find_header_file("cudnn_graph.h")
cudnn_ops_header = find_header_file("cudnn_ops.h")
cuda_header = find_header_file("cuda.h")
cublas_header = find_header_file("cublas_api.h")
cudart_header = find_header_file("cuda_runtime_api.h")
annotations_header = find_header_file("annotations.h")cd codegen && python3 ./codegen.pyEnsure there are no errors in the output of the codegen.
cmake -S . -B build
cmake --build buildCMake builds the CUDA driver shim at build/libcuda.so.1, the NVML shim at
build/libnvidia-ml.so.1, and the server at build/lupine_driver_server.
The Lupine server must be running before initiating client commands.
./local.sh serverIf successful, the server will start:
Server listening on port 14833...For local development, preload the built libcuda.so.1 before executing CUDA
commands. The published client image sets LD_LIBRARY_PATH for you instead.
Once the server above is running:
# update to your desired IP/port
export LUPINE_SERVER=<server>:14833
LD_PRELOAD=./build/libcuda.so.1 python3 -c "import torch; print(torch.cuda.is_available())"
# or
LD_PRELOAD=./build/libcuda.so.1 nvidia-smiYou can also use the local shell script to run your commands.
./local.sh run
- What does LUPINE stand for? Nothing, it just looks cool in all caps.
- Does this support authentication? TLS? Indirectly, yes. It's a plain HTTP/2 server, so you can front it with whatever TLS/auth server you want.
- Was this repo AI-generated? A chunk of it, yes. I mean, would you want to hand write hundreds of tedious API stubs? No? Me neither.
- Doesn't this incur a lot of latency? Surprisingly, no! You will see device transfers get slower because this is basically bottlenecking a PCIe link over the network, but there is very little overhead besides that. For things like model training and inference, once the model is on the GPU very little data transfer happens to the host. As a result, it might be faster than you expect.
- Can I do remote video encoding/decoding? This is probably one use case we wouldn't recommend because that's a lot heavier on the PCIe link. It works in theory though, so if you do have access to a 1 Tbps link it might work for you.
This project is inspired by some existing proprietary solutions: