Skip to content

Conversation

@kpouget
Copy link

@kpouget kpouget commented Jan 9, 2026

This is a follow up of #17072

The API Remoting backend/frontend allow escaping the VM isolation, with the help of the virt-gpu paravirtualization (and the virglrenderer library on the host side).

  • ggml-remotingfrontend is a GGML API implementation, which intercepts the GGML API calls and forwards them to the virt-gpu virtual device
  • ggml-remotingbackend is library loaded by virglrenderer (PR will be opened soon for discussion), which opens a GGML library and forwards the call received from virglrenderer.

Here is the context behind this PR:

image

See the Virglrenderer PR which enables the API Remoting trampoline required in Virglrenderer:
https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590

  • this work focused on MacOS, where the in VM/container inference performance are tight to the remoting stack

  • the code works on Linux. I didn't evaluate thoroughly the performance.

  • Add support for the APIR capset containers/libkrun#508 --> libkrun VMM patch that allows the routing of the APIR capset to Virglrenderer

Disclaimer: I got helped by Claude Code to finalize this PR. Mostly through pre-submit reviews (no automated C code generation involved). Claude Code did generate the Python code generator (see the *.gen.h and *,gen.c files) used for the backend/frontend RPC (it was generated based on the C/H files I had manually written).

@kpouget kpouget requested a review from ggerganov as a code owner January 9, 2026 13:29
@kpouget kpouget changed the title ggml: new backend for Virglrenderer API Remoting ggml: new backend for Virglrenderer API Remoting (v2) Jan 9, 2026
@kpouget kpouget changed the title ggml: new backend for Virglrenderer API Remoting (v2) ggml: new backend for Virglrenderer API Remoting acceleration (v2) Jan 9, 2026
@github-actions github-actions bot added build Compilation issues python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jan 9, 2026
This flag allows disabling the ggml-vulkan backend at runtime.

This is necessary for the API Remoting support, as the API Remoting
frontend (`ggml-remotingfrontend`) relies on the same device file as
`ggml-vulkan`, when running inside a Virtual Machine.

This runtime disable flag allows enabling the compilation of both
`ggml-vulkan` and `ggml-remotingfrontend`, while selecting at runtime
which one should be activated.
@taronaeo
Copy link
Collaborator

I'll review this in awhile. If we were to merge this, we will need a named maintainer for the backend for maintainability reasons. Will it be you? :)

Copy link
Collaborator

@taronaeo taronaeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Spacing across the PR is very inconsistent. Please follow 4 spaces and make it consistent.
  2. The vendor files within ggml-remotingfrontend/include - can they be discovered/downloaded separately from the codebase? See:
    - Avoid adding third-party dependencies, extra files, extra headers, etc.
  3. Inconsistent styling:
__attribute__((unused))
static inline const char *apir_command_name(ApirCommandType type)
{

vs.

static ggml_status ggml_backend_remoting_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {

Please follow CONTRIBUTING.md: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

struct timer_data graph_compute_timer = {0, 0, 0, "compute_timer"};

uint32_t
backend_backend_graph_compute(struct apir_encoder *enc, struct apir_decoder *dec, struct virgl_apir_context *ctx) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
backend_backend_graph_compute(struct apir_encoder *enc, struct apir_decoder *dec, struct virgl_apir_context *ctx) {
backend_backend_graph_compute(apir_encoder * enc, apir_decoder * dec, virgl_apir_context * ctx) {

See:

- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a`

Likewise for the rest of the codebase.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I may not have been clear. This change should only affect C++ files, while C source and header files can continue to use struct ...

- Declare structs with `struct foo {}` instead of `typedef struct foo {} foo`
- In C++ code omit optional `struct` and `enum` keyword whenever they are not necessary
```cpp
// OK
llama_context * ctx;
const llama_rope_type rope_type;
// not OK
struct llama_context * ctx;
const enum llama_rope_type rope_type;
```
_(NOTE: this guideline is yet to be applied to the `llama.cpp` codebase. New code should follow this guideline.)_

I'm not sure if omitting them in C source and header files would break anything for consumers using your backend... If it doesn't break anything then feel free to ignore this comment :)

Copy link
Author

@kpouget kpouget Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the files are C++,
and indeed, nothing broke so I think it's safe as is :)

@kpouget
Copy link
Author

kpouget commented Jan 12, 2026

thanks for the review @taronaeo, I think I followed and fixed all the suggestions

If we were to merge this, we will need a named maintainer for the backend for maintainability reasons. Will it be you? :)

yes, would be me indeed :)

Copy link
Collaborator

@taronaeo taronaeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a lot better now, thank you for cleaning the code.

  1. I'm still wondering, are the 3rd party vendor files required to be part of GGML/Llama.cpp? (Can they be downloaded separately during development time via a script?)
  2. I'm not sure if I missed it, but I don't see the required GGML_BACKEND_DL_IMPL macro call in this PR. Did GGML register your backend correctly?
  3. #18718 (comment)

I'm also interested in testing this PR out on my MacBook. Do you have any guides/steps for me to follow to test it?

#include <ostream>
#include <thread>

int ggml_backend_remoting_get_device_count();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this file supposed to be a header file? Looks like the implementation is missing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was a stalled file, removed

struct timer_data graph_compute_timer = {0, 0, 0, "compute_timer"};

uint32_t
backend_backend_graph_compute(struct apir_encoder *enc, struct apir_decoder *dec, struct virgl_apir_context *ctx) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I may not have been clear. This change should only affect C++ files, while C source and header files can continue to use struct ...

- Declare structs with `struct foo {}` instead of `typedef struct foo {} foo`
- In C++ code omit optional `struct` and `enum` keyword whenever they are not necessary
```cpp
// OK
llama_context * ctx;
const llama_rope_type rope_type;
// not OK
struct llama_context * ctx;
const enum llama_rope_type rope_type;
```
_(NOTE: this guideline is yet to be applied to the `llama.cpp` codebase. New code should follow this guideline.)_

I'm not sure if omitting them in C source and header files would break anything for consumers using your backend... If it doesn't break anything then feel free to ignore this comment :)

#include "virtgpu-forward-impl.h"
#include "virtgpu-shm.h"

int apir_device_get_count(virtgpu * gpu) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this function is public-facing, right?

- Use sized integer types such as `int32_t` in the public API, e.g. `size_t` may also be appropriate for allocation sizes or byte offsets

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it's not public facing, it's only internally consumed.

overall, nothing is public facing, apart from the GGML entrypoints and function tables. So AFAIU, the prototype of all the functions called externally is imposed from the function tables.

The only one not imposed by the tables might be this one:

GGML_BACKEND_API ggml_backend_reg_t ggml_backend_remoting_frontend_reg();

@kpouget
Copy link
Author

kpouget commented Jan 13, 2026

I'm also interested in testing this PR out on my MacBook. Do you have any guides/steps for me to follow to test it?

sure :)

the blog post has the steps to reproduce it with pre-compiled binaries:
https://developers.redhat.com/articles/2025/09/18/reach-native-speed-macos-llamacpp-container-inference#try_api_remoting_with_ramalama

actually, you should be able to follow the INSTALL steps from my release page:
https://github.com/crc-org/llama.cpp/releases/tag/b7356-remoting-0.3.0

(I'll try to regenerate the binaries before the end of the week)

and this document has the steps to rebuild the different sources, you can request access

happy to discuss it on IBM-RH slack if you need help

@kpouget
Copy link
Author

kpouget commented Jan 14, 2026

For information, I'll be at FOSDEM at the end of the month to present the work behind this PR:
https://fosdem.org/2026/schedule/event/C9NF8K-api_remoting_for_llama_cpp_near-native_gpu_speed_in_macos_containers/

@kpouget
Copy link
Author

kpouget commented Jan 14, 2026

I'm not sure if I missed it, but I don't see the required GGML_BACKEND_DL_IMPL macro call in this PR. Did GGML register your backend correctly?

indeed, I'm not using it at the moment (and everything works fine), I'll review tomorrow how it should be used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Compilation issues ggml changes relating to the ggml tensor library for machine learning python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants