Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose stable syscall interface via DeviceIoControl #3700

Closed
lmb opened this issue Jul 9, 2024 · 3 comments
Closed

Expose stable syscall interface via DeviceIoControl #3700

lmb opened this issue Jul 9, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request P2 triaged Discussed in a triage meeting
Milestone

Comments

@lmb
Copy link
Contributor

lmb commented Jul 9, 2024

Describe the feature you'd like supported

As I understand, the stable API boundary for this project is currently a (1) libbpf compatible API and a (2) bpf() compat layer built on top of (1). Both of these are implemented in a .dll which calls out to the kernel component via DeviceIoControl.

I maintain github.com/cilium/ebpf, which is a popular Go library for interacting with Linux's eBPF subsystem. It would be great if we could offer Windows support in some fashion as well. Early on we made the decision that the library can not rely on CGo: software that relies on it is notoriously hard to build / package. CGo also has a noticeable performance impact, although this has gotten better over time. This means that on Linux we do not call into libbpf, and instead directly perform bpf syscalls. This works because the syscall interface on Linux is considered a stable API which mustn't be broken.

To be able to add Windows support under the same constraints we need to be able to perform direct syscalls on Windows, which in turn would require the project to commit to keeping the kernel to user space API stable. Therefore I proposed to make DeviceIoControl calls a stable / supported API to interact with eBPF on Windows.

Proposed solution

  • Document that invocation via DeviceIoControl is considered a stable API.
  • Publish a machine readable description of the API as part of the eBPF for Windows repository. This allows us to automatically generate Go syscall bindings, as we do on Linux (currently using BTF). This could be JSON or maybe C header files (needs some experimentation).

Additional context

No response

@lmb lmb added the enhancement New feature or request label Jul 9, 2024
@dahavey dahavey added the triaged Discussed in a triage meeting label Jul 15, 2024
@dahavey dahavey added this to the 2407 milestone Jul 15, 2024
@dahavey dahavey added the P2 label Jul 15, 2024
@dahavey dahavey modified the milestones: 2407, 2408 Jul 15, 2024
@lmb
Copy link
Contributor Author

lmb commented Jul 22, 2024

Feedback from last week's triage call:

  1. Technically, there is no stable API at the moment. But libbpf API surface has tests, etc.
  2. There are actually two ways programs are loaded:
    1. Via an RPC mechanism defined in IDL that talks to the verifier running in user space. This is similar to the Linux API in that it allows loading arbitrary programs.
    2. Via DeviceIoControl as defined in ebpf_protocol.h. This only allows loading signed eBPF / native images / Windows Drivers. The API is significantly more restrictive than what is available on Linux.
  3. The current thinking is that native images is the way to go due to MS security stance / the desire to push for HVCI everywhere.
  4. Other common operations like map lookups etc. go via DeviceIoControl.

I've been experimenting with building an API for the DeviceIoControl (aka Native Image) path in Go. The results are promising, but there are some small snags I've hit so far.

  • ebpf_protocol.h makes use of uint8_t data[1] fields to indicate trailing variable length data after a request or reply header. This really messes with the ability to automatically generate bindings in Go, since arrays are a real thing there (aka they don't just desugar into a pointer). Would it be possible to drop these in favour of trailing variable-length arrays, for example uint8_t data[]? Alternatively, controlling these via an overridable macro could also work, for example EBPF_VARIABLE_DATA(uint8t, data).
  • It would be great to be able to define a custom type for GUID, to ease code generation. This could also go via an overridable macro. Similar for other commonly used types like the handle abstraction.
  • The way the syscall interface is designed makes it difficult to do zero allocation syscalls. For example, when doing a map lookup we need to prepare a buffer which contains the request struct and the key. This means allocating another buffer. On Linux we can avoid the overhead because the request buffer simply contains a pointer to the key memory. Is this deliberate? Is there a way to avoid this? P.S: Created ioctl: allow common operations without allocations #3726
  • Why does OPERATION_CLOSE_HANDLE exist? Do we need to use it instead of the normal CloseHandle call?

@lmb
Copy link
Contributor Author

lmb commented Jul 23, 2024

I think there are two unavoidable requirements to make this work:

  1. A new version of the library needs to work on an old version of eBPF for Windows.
  2. An old version of the library needs to work on a new version of eBPF for Windows.

How do we achieve this without freezing the API forever?

One approach is to say that we carry all that burden on the library side. For
each released version of eBPF for Windows we generate a new set of bindings.
We then use the appropriate binding for the version we are running against.
This poses several problems:

  • We need to invent a mechanism that ties a version into an API, otherwise
    we can't choose the right bindings. This is actually really hard and easy to
    mess up.
  • Since detection can happen only at runtime we impose overhead for each syscall.
  • Massively inflates the code base, since each version duplicates the bindings.

I mention his approach more as a strawman, because I don't think it is feasible.
It satisfies requirement (1), but for (2) we'd need to keep the same versioned
API in the eBPF for Windows project. That really seems unworkable to me.

We could adopt an approach which is similar to the Linux syscall layer. This means
that the current approach has to be adapted a little.
Feel free to skip to the proposal if you are familiar with the DeviceIoControl interface.

DeviceIoControl request and response model

We pass a variable length buffer to the driver via DeviceIoControl, which replies
in another variable length buffer. Both buffers contain a structure that looks
roughly like this (ebpf_operation_map_find_element_request_t and so on):

 request_t / reply_t
 ┌──────────────────────┐ ┐
 │ header_t             │ │
 │ ┌─────────────────┐  │ │
 │ │ length          │◄─┼─┤
 │ └─────────────────┘  │ │
 │                      │ │
 │ fixed size fields    │ │
 │                      │ │
 ├──────────────────────┤ │
 │                      │ │
 │ variable size fields │ │
 │                      │ │
 └──────────────────────┘ ┘

Each request and reply starts with the same header, which contains an operation
id (omitted here) and a length. This is followed by fixed size fields like flags
or handles. The remainder of the buffer is used for variable sized fields, such
as strings.

The length field in the header covers the entire size of the buffer. This makes
it impossible to extend the fixed part. For example, here is the request buffer to create a map:

typedef struct _ebpf_operation_create_map_request
{
    struct _ebpf_operation_header header;
    ebpf_map_definition_in_memory_t ebpf_map_definition;
    ebpf_handle_t inner_map_handle;
    uint8_t data[1];
} ebpf_operation_create_map_request_t;

The data[1] field indicates that there is a variable length buffer containing
the name of the map. Note that there is no field to indicate the length of the
name, instead it is inferred from header.length.

Adding a field after inner_map_handle would break old versions of ebpf-go
because the driver would interpret the beginning of the name as that new field.
Conversely, we'd also have problems if a new version of ebpf-go tried to use the
extended struct on an old version of eBPF for Windows. In this case the new field
would be interpreted as the name.

Proposal

To allow extending the fixed fields of a request we need to encode the length
of the fixed part somewhere. I propose to reuse header.length and to recover
the size of the variable part from the nInBufferSize parameter passed to DeviceIoControl:

 request_t / reply_t
 ┌──────────────────────┐ ┐  ┐
 │ header_t             │ │  │
 │ ┌─────────────────┐  │ │  │
 │ │ length          │◄─┼─┤  ├───► DeviceIoControl(nInBufferSize)
 │ └─────────────────┘  │ │  │
 │                      │ │  │
 │ fixed size fields    │ │  │
 │                      │ │  │
 ├──────────────────────┤ ┘  │
 │                      │    │
 │ variable size fields │    │
 │                      │    │
 └──────────────────────┘    ┘

On the driver side we now know whether user space provided us with the new field
or not, and can change the logic accordingly. This satisfies requirement (1): old
versions of the library continue to function.

It doesn't help with the inverse case though. A new version of the library may
pass a request which is larger than what the driver expects:

 User space         │ Driver
                    │
 struct request {   │ struct request {
   header_t header; │   header_t header;
   t1 a;            │   t1 a;
   t2 b;            │ };
 };                 │

The Linux bpf() syscall solves this by accepting a larger request buffer than
expected, provided that the "unknown" tail is all zero. This relies on all
new fields being defined in a way where zero keeps the old behaviour.
This works pretty well and also provides a convenient way to do
feature probing. The exception are fields containing file descriptors.
On Linux, zero is a valid file descriptor but it is not possible to use it in
the syscall.

Windows doesn't seem suffer from this exact same problem (zero is an invalid handle) but there might be other fields where it applies. In any case,
it's possible to work around this on the driver side by checking whether a field was provided:

if (request_len >= offsetofend(request, b)) {
	// use request->b
}

Testing

How can we make sure that modifications to the API really are backwards compatible?
From the ebpf-go side we already run the full testsuite against stable kernel.org
releases. We can do something similar for the Windows support:

  • Each PR is run against each supported release of eBPF for Windows. This
    ensures that ebpf-go stays backwards compatible.
  • We schedule the ebpf-go testsuite against the last successful eBPF for Windows build from
    main. If we do this daily we can quickly provide feedback when something breaks.

On the eBPF for Windows side we can investigate adding static_assert to encode
more of the assumptions made by the ioctl layer.

@shankarseal can you please remove the triage label and maybe float this amongst your colleagues for Monday's meeting?

@lmb
Copy link
Contributor Author

lmb commented Jul 29, 2024

Closing in favour of #3729. This proposal might still make sense in the future if the other approach doesn't have the performance we need.

@lmb lmb closed this as completed Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P2 triaged Discussed in a triage meeting
Projects
None yet
Development

No branches or pull requests

2 participants