Skip to content

Introduce support for tight bounds on the kernel stack.#2573

Open
qwattash wants to merge 10 commits intodevfrom
bounded-kstack
Open

Introduce support for tight bounds on the kernel stack.#2573
qwattash wants to merge 10 commits intodevfrom
bounded-kstack

Conversation

@qwattash
Copy link
Copy Markdown
Contributor

@qwattash qwattash commented Feb 26, 2026

The implementation differs slightly depending on the architecture.

RISC-V uses kstack for the pcb, kernframe and trapframe structures.
These patches set tight bounds for the pcb, kernframe and remainder of the kernel stack.
The trapframe is left together with the kernel stack, given that it is part of the normal arithmetic on the kernel stack pointer in the trap handlers at the moment.
The sscratchc register is now used to hold a pointer to struct kernframe instead of the full kstack capability.
The kernframe is expanded to hold a bounded pointer to the kernel stack region and a scratch pointer.
The scratch pointer is used in the trap handler to swap register contents with constrained use of CPU registers.

Edit:

  • Changed approach. Do not disrupt sscratchc or the existing stashed stack pointer. The following applies to all architectures:
    1. The stashed / banked stack pointer includes all kstack, except for struct pcb bounds. Representability is guaranteed.
    2. struct pcb has separate bounds that never overlap with the stack pointer.
    3. td_kstack is considered the root capability for all kernel stack sub-allocations.
    4. Upon entering the trap handler from userland, the trap handler accesses td_frame (and kernframe in RISC-V) from the csp pointer. It then proceeds to shrink csp to exclude the td_frame (and kernframe) regions. Again, representablity is guaranteed by construction of td_frame.
    5. Upon exiting the trap handler to userland, the trap handler recovers the csp bounds to include td_frame (and kernframe) prior to loading registers and returning to userland.
  • Implemented Morello

Note that this is all gated by the CHERI_BOUNDED_KSTACK option, because I'd like to be able to measure the difference for dissertation purposes. Also, this is an intermediate patch for the use of local/global for kernel capability flow enforcement, which is currently WIP.

@jrtc27
Copy link
Copy Markdown
Member

jrtc27 commented Feb 26, 2026

Hm, I'm not immediately convinced the kernframe stuff is worth it

@qwattash
Copy link
Copy Markdown
Contributor Author

Hm, I'm not immediately convinced the kernframe stuff is worth it

I asked myself a similar question, hence the kernel option to enable/disable it.

However, this is partially necessary for my local/global patches, where I'd like td_frame to not have STORE_LOCAL_CAP permission.
The current patch still does not separate td_frame, but should make it easier to split out. The alternative is to do some bounds-setting in the trap handler and avoid some of this trouble.
I'm not entirely sure whether this is the best way to handle this, but I'd like to avoid installing a kernel stack with full bounds in some form or another.

@qwattash qwattash force-pushed the bounded-kstack branch 2 times, most recently from 862de4d to cb57508 Compare March 6, 2026 18:44
@qwattash qwattash marked this pull request as ready for review March 6, 2026 18:50
qwattash added 10 commits March 23, 2026 15:43
This is used to enable tight bounds on data structures that share the
kernel stack allocation, for example, struct pcb.
This enables tight bounds on the kernel stack sub-allocations.
In particular, the pcb, kernframe and actual kernel stack are now fully
separated. The td_kstack capability retains full bounds.

This modifies the trap handlers to stash the kernframe structure in sscratchc,
instead of the kernel stack pointer.
The kernel stack and the pcpu capabilities are recovered from the kernframe
structure, without assuming that out-of-bounds access is possible.
This simplifies the management of sscratch, leaving it unchanged.
In user mode, sscratch still holds the full unbounded kstack (without the pcb)
and the trap handler can use it to access kernframe and trapframe.
Before entering the C exception handler, set narrower bounds on trapframe and
the stack pointer installed in csp.
This is perhaps not the optimal place for these assertions, however these
should hold every time we enter the kernel and check that the thread
kstack context is in a consistent state.
This enables tight bounds on the kernel stack sub-allocations.
In particular, the pcb, and actual kernel stack are now fully
separated. The td_kstack capability retains full bounds.

The exception handlers are modified to re-derive the trapframe
and kernel stack capabilities from the root td_kstack capability.
Copy link
Copy Markdown
Collaborator

@bsdjhb bsdjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider doing the upstream approach for struct pcb used on amd64 where it is now just part of struct mdthread as a md_pcb field? That could be upstreamed to FreeBSD as well which might reduce our diff and reduce the complexity of this change a bit by not having to worry about the pcb anymore?

* user stack pointer while we keep kernelframe in sscratchc.
*/
.if \mode == 0
/* Stash user ctp in kframe stash and place kframe ptr in ctp */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ctp rather than ct0 similar to the block of code above for hybrid kernels when deriving csp from ddc?

/* Stash user ctp in kframe stash and place kframe ptr in ctp */
csc ctp, (KF_SCRATCH)(csp)
cmove ctp, csp
/* Fetch real kstack from kframe */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal style(9) is a blank line before comments. We don't use them when there is already an effective break in the flow due to C preprocessor or assembly macro conditionals.

KASSERT((uintptr_t)get_pcpu() >= VM_MIN_KERNEL_ADDRESS,
("Invalid pcpu address from userland: %p (tpidr 0x%lx)",
get_pcpu(), READ_SPECIALREG(tpidr_el1)));
#ifdef CHERI_BOUNDED_KSTACK
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commit separate so that reviewers can evaluate both approaches? If so, if you end up choosing this version, please squash this down into the previous commit and revise the log message to reflect the end result. I'm not sure it's worth having the in-between stage in the history as-is.

.macro load_registers mode
#ifdef CHERI_BOUNDED_KSTACK
.if \mode == 0
/*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider just saving the full kstack cap you need in a new field in td_md that you can reload here so you don't have to do all this computation on each syscall exit, only when creating a new thread?

#include <machine/trap.h>
#include <machine/riscvreg.h>

.macro save_registers mode
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment here above the start of the macro to state that in the CHERI_KSTACK_BOUNDS case it intentionally returns a bounded pointer to the created trapframe in cs0 for use by callers instead of documenting that in the callers.

#if __has_feature(capabilities)
p2->p_md.md_sigcode = td1->td_proc->p_md.md_sigcode;
#endif
#ifdef __CHERI_PURE_CAPABILITY__
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of the earlier commit that cleared the permissions? (And can we add a similar assert on Morello?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants