Low-Latency GPU-Aware Inference Data Plane
This project implements a high-performance TCP-based inference data plane in modern C++, designed to serve CUDA-accelerated LLM inference using llama.cpp as the backend.
The goal is to explore low-level networking, concurrency, and GPU integration challenges that arise when building inference infrastructure under tight hardware constraints, such as limited VRAM and legacy GPUs.
Rather than relying on high-level frameworks, this project focuses on explicit control over networking, memory usage, and request scheduling, mirroring real-world inference serving environments.
- Process lifecycle management
- Server configuration
- Backend selection (llama.cpp CUDA backend)
- Custom TCP binary protocol
epoll-based non-blocking I/O- Concurrent request handling
- Explicit backpressure control
- CUDA-backed llama.cpp
- GGUF quantized models
- Optimized for low-VRAM environments (≤1 GB)
- Minimal binary framing for inference requests/responses
- Designed for predictable latency and low overhead
- Non-blocking socket I/O
- Scales concurrent clients without thread-per-connection overhead
- Direct integration with CUDA-enabled llama.cpp
- Optimized for constrained GPUs (Compute Capability 5.0)
- Latency tracking (p50 / p95 / p99)
- Throughput measurement under concurrent load
TCP was intentionally chosen as the baseline transport:
- Universally available
- Easy to debug and profile
- Provides a reference point for evaluating RDMA / UCX trade-offs
The design cleanly separates transport logic from inference execution, enabling future extension to RDMA or GPU-direct transports without rewriting the inference core.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
./build/bin/llama_tcp_server \
--model /path/to/model.gguf \
--listen 0.0.0.0:8080- Explicit control over data movement
- Predictable latency under load
- Minimal abstraction overhead
- Clear separation between networking and inference logic
Although runnable on a single node, this project implements distributed-systems primitives:
- Network-based request routing
- Stateless serving
- Backpressure and flow control
- Separation of control plane and data plane
It is designed to serve as a foundation layer for multi-node inference systems using MPI, RDMA, or GPU-aware transports.
- C++17 / C++20
- CUDA
- epoll / POSIX sockets
- llama.cpp (GGUF models)
- CMake
Mohammad Waqas GitHub: https://github.com/waqasm86