Skip to content

Commit 3d97454

Browse files
Alfredo Mazzinghiqwattash
authored andcommitted
Add experimental section on Two-Phase revoker strategy.
This is a proposed mechanism to reduce the use of PTE bits for hardware revocation assistance.
1 parent a1638d9 commit 3d97454

File tree

2 files changed

+307
-0
lines changed

2 files changed

+307
-0
lines changed

app-experimental.tex

Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,285 @@ \subsection{Non-Temporal (Streaming) CLC} % <<<
108108
on the LL/SC link flag. In some toy examples, this seems to make full SMR
109109
hazards unnecessary. Unclear that it is worth pursuing.}
110110

111+
% >>>
112+
\section{Reduced PTE Usage of Architectural Revocation} % <<<
113+
\label{app:exp:dirtycrg}
114+
115+
\subsection{Motivation} % <<<
116+
The load-barrier revocation model in the style of Cornucopia Reloaded
117+
\cite{cornucopia-reloaded} has been demonstrated with the CHERI PTE extensions
118+
described in \cref{subsection:riscv:pagetables}, as well as with the Morello
119+
PTE extensions \cite{arm-morello}.
120+
These solutions are considered superset implementations, where the bits
121+
allocated in the PTE tables allow a rich set of behaviours for experimentation.
122+
This comes at the cost of additional consumption of PTE bits, which are scarce.
123+
124+
As part of the CHERI RISC-V standardisation effort, it becomes necessary to
125+
implement hardware to efficiently support a Cornucopia Reloaded-style revoker,
126+
while minimising the PTE bit usage. To this end, the CHERI RISC-V standard draft
127+
0.9.3 \cite{riscv-cheri-draft-0-9-3} (henceforth referred to as Zcheri)
128+
introduces a 2-bit PTE extension that is meant to enable Cornucopia Reloaded
129+
revocation.
130+
This is a significant saving with respect to both the 5 bits extension introduced by
131+
\cref{subsection:riscv:pagetables} and the 4 bits extension in Morello,
132+
which focus on enabling experimentation.
133+
\amnote{Since presumably the details of previous sections may change, should I
134+
refer here to a specific CHERI ISA version? E.g. v9?}
135+
136+
\subsubsection{RISC-V Zcheri 0.9.3 PTE Extensions}
137+
The 2-bit PTE extension as specified in the CHERI RISC-V standard draft 0.9.3
138+
defines the CW and CRG bits. The CW bit is defined to control both capability
139+
load and store. The CRG bit has a similar behavior to the CHERI ISAv9 CRG bit;
140+
however, when the CW bit is clear, the CRG bit is overloaded to encode the
141+
capability-dirty tracking state as follows:
142+
143+
\begin{center}
144+
%
145+
\begin{tabular}{ccl}
146+
\textbf{CW} & \textbf{CRG} & \textbf{Load Behavior} \\
147+
0 & 0 & Capability loads strip tags on loaded result \\
148+
0 & 1 & Capability loads strip tags on loaded result \\
149+
1 & X & Generational load barrier, trap on load if Sstatus.UCRG $\neq$ CRG \\
150+
\end{tabular}
151+
%
152+
\begin{tabular}{ccl}
153+
\textbf{CW} & \textbf{CRG} & \textbf{Store Behavior} \\
154+
0 & 0 & Trap on capability stores \\
155+
0 & 1 & Track capability dirty, CW is set, CRG is set to Sstatus.UCRG \\
156+
1 & X & Capability stores are unaltered \\
157+
\end{tabular}
158+
%
159+
\end{center}
160+
161+
For this discussion, we associate names to each combination of the \{CW, CRG\}
162+
bits, according to the semantic of the Cornucopia Reloaded model.
163+
A PTE entry is in the \textit{Dirty} state when CW=1, in the
164+
\textit{Dirtiable} state when CW=0 CRG=1 and in the \textit{Clean} state when
165+
both CW=0 and CRG=0.
166+
167+
\subsubsection{Limitations of the Zcheri 0.9.3 PTE Extensions}
168+
The 2-bit PTE extension as specified in the CHERI RISC-V standard draft 0.9.3
169+
has some limitations that do not allow the Cornucopia Reloaded revoker model
170+
to be implemented. This is both a consequence of how the Reloaded revoker is
171+
tuned to reduce TLB shootdowns, as well as fundamental design choices of the
172+
Zcheri PTE bits extension.
173+
174+
The Zcheri PTE extension has two fundamental limitations:
175+
\begin{itemize}
176+
\item The CRG bit is overloaded to encode both the load-side generation and the
177+
capability-dirty tracking PTE state (also referred as Dirtiable
178+
state). This makes it impossible to retain the load-side generation
179+
information when a PTE entry is in the Dirtiable state.
180+
\item The Dirtiable PTE state interacts poorly with the CW bit
181+
semantic when dealing with aliasing pages. In particular, it is impossible
182+
to use the capability-dirty tracking state for aliasing pages that may
183+
contain capabilities.
184+
\end{itemize}
185+
186+
In particular, the Dirtiable state is problematic because the
187+
Reloaded revoker relies on two important properties:
188+
\begin{enumerate}
189+
\item The PTE can be configured to have load-side barrier semantic while in
190+
the Dirtiable state.
191+
\item The PTE transitions from Dirtiable to Dirty as a
192+
result of a capability store leave the CRG bit is unchanged. This
193+
encodes the generation for the entry, depending on whether the
194+
revoker scan has reached it or not.
195+
\end{enumerate}
196+
197+
The Reloaded revoker uses the Dirtiable state as an intermediate state
198+
while transitioning a (presumed) capability-clean page from
199+
\textit{occupied}\footnote{A page is occupied if it holds or may recently have
200+
held capabilities.} to \textit{idle}\footnote{A page is idle if it does not
201+
contain any capabilities, idle pages can be skipped by the revoker when scanning
202+
memory and play an important part in reducing the number of pages scanned.}.
203+
Because of the limitations outlined above, the revoker has to accept some
204+
trade-offs to use the Zcheri 0.9.3 PTE extension.
205+
206+
\begin{itemize}
207+
\item The Dirtiable state can not be used for aliasing pages.
208+
This is both because of the tag clearing semantic on loads and the
209+
CRG update rule.
210+
\item Transitions from Dirty to Dirtiable break the
211+
ability of the revoker to leave the TLB slightly cap-dirtier than
212+
the PTE entries. This means that additional TLB invalidations are needed.
213+
\end{itemize}
214+
215+
These trade-offs stem from an analysis of possible races between the program
216+
and the revoker, which result in violations of the Cornucopia Reloaded
217+
invariants.
218+
219+
\subsection{Two-phase CRG Model}
220+
The Two-phase CRG model is a derivation of the Zcheri 0.9.3 2-bit PTE extension
221+
with changes to the Dirtiable state semantic and the addition of an
222+
extra bit in the Sstatus register.
223+
The additional bit in Sstatus is significantly less impactful than introducing
224+
a new PTE bit and, in general, it can be placed in another CSR if Sstatus
225+
register bits become scarce.
226+
227+
The bit UDCRG (User Dirty CRG) is added to the Sstatus register and is intended
228+
to complement the existing UCRG (User CRG) bit.
229+
The UDCRG bit represents the epoch number that the revoker is currently closing,
230+
as opposed to the UCRG bit, which represents the epoch that is currently open.
231+
This fundamentally enables the revoker to communicate to the hardware that
232+
a revocation sweep is in progress. As an aside, it is fairly easy to model
233+
this in a slightly different way, using a ``Revocation In Progress''
234+
bit instead.
235+
236+
The \{UDCRG, UCRG\} bit pair have the following architectural meaning:
237+
\begin{center}
238+
%
239+
\begin{tabular}{ccl}
240+
\textbf{UDCRG} & \textbf{UCRG} & \textbf{Behavior} \\
241+
0 & 0 & Epoch $E_0$ steady state \\
242+
0 & 1 & Revocation in progress for the epoch transition $E_0 \rightarrow E_1$ \\
243+
1 & 0 & Revocation in progress for the epoch transition $E_1 \rightarrow E_0$ \\
244+
1 & 1 & Epoch $E_1$ steady state \\
245+
\end{tabular}
246+
\end{center}
247+
248+
The UDCRG bit modifies the PTE CRG update rule when transitining from the
249+
Dirtiable (CW=0 CRG=1) state to the Dirty state.
250+
The behavior of the CW and CRG bits is modified as follows
251+
\begin{center}
252+
%
253+
\begin{tabular}{ccl}
254+
\textbf{CW} & \textbf{CRG} & \textbf{Load Behavior} \\
255+
0 & 0 & Capability loads strip tags on loaded result \\
256+
0 & 1 & Load fault when Sstatus.UCRG $\neq$ Sstatus.UDCRG \\
257+
1 & X & Generational load barrier, trap on load if Sstatus.UCRG $\neq$ CRG \\
258+
\end{tabular}
259+
%
260+
\begin{tabular}{ccl}
261+
\textbf{CW} & \textbf{CRG} & \textbf{Store Behavior} \\
262+
0 & 0 & Trap on capability stores \\
263+
0 & 1 & Track capability dirty, CW is set, CRG is set to Sstatus.UDCRG \\
264+
1 & X & Capability stores are unaltered \\
265+
\end{tabular}
266+
%
267+
\end{center}
268+
269+
\subsubsection{Rationale and Software Operation}
270+
The revoker is responsible for correctly switching the Sstatus.{UDCRG, UCRG}
271+
bits, in accordance with the revocation state machine.
272+
When moving from epoch $E_0$ to epoch $E_1$, the revoker toggles the
273+
Sstatus.UCRG bit and begins the background scan. This is unchanged from the
274+
Reloaded model; however, the UDCRG bit remains set to the previous epoch.
275+
When the revoker completes the background scan, the UDCRG is set to UCRG;
276+
in other words, UDCRG ``catches up'' to the current epoch.
277+
278+
This effectively splits up the epoch into two phases. The \textit{revocation%
279+
phase} spans the period of time between the beginning of the epoch and the
280+
end of the background scan. The \textit{steady-state phase}, which starts when
281+
the background scan completes and lasts until the beginning of the next epoch.
282+
283+
The revocation hardware assistance mechanism outlined above is designed to
284+
mitigate the loss of CRG information in the Dirtiable PTE state.
285+
It is important to consider the Zcheri Dirtiable state behavior during both
286+
the steady-state and revocation phases.
287+
288+
In the steady-state phase, the Zcheri Dirtiable PTE behavior is never
289+
problematic because there is no ambiguity towards the load-side generation epoch
290+
associated to a PTE that becomes dirty: if a capability is written via a
291+
dirtiable mapping, the PTE entry should be considered Dirty at the current epoch.
292+
293+
The Zcheri Dirtiable state is problematic during the revocation phase.
294+
In particular, when a capability is written via a Dirtiable PTE, the new value
295+
of the CRG bit is undecidable without additional information.
296+
Consider the following two cases:
297+
\begin{enumerate}
298+
\item The background scan has not visited the page: the CRG bit should be set
299+
to the previous epoch CRG, so that the background scan will correctly detect
300+
that the PTE has not been visited yet.
301+
\item The background scan has visited the page: in this case the background scan
302+
has left the PTE in the Dirtiable state, so a subsequent write should upgrade
303+
the PTE to Dirty in the current epoch.
304+
\end{enumerate}
305+
It is now clear that the problem arises from the fact that the CRG update
306+
behavior is dependent on whether the revoker has visited the page or not,
307+
which can not be determined by observing a Dirtiable PTE entry alone.
308+
309+
The 2-phase CRG model attempts to recover this information without introducing
310+
additional PTE state. This comes at the cost of some changes to the Reloaded
311+
revoker.
312+
This introduces an hypothesis that these changes are less costly and invasive
313+
than other options, such as using a 3-bit PTE solution or implementing a
314+
revoker that does not rely on the Dirtiable state in this way.
315+
316+
The two-phase CRG model has some known limitations that require changes to the
317+
existing revoker implementations.
318+
319+
\begin{enumerate}
320+
\item A capability write via a Dirtiable PTE that has already been observed by
321+
the background scan breaks the load-side invariant in the next epoch.
322+
\item The transition of a page to Idle may require a TLB invalidation because
323+
there may be Dirtiable TLB entries.
324+
\end{enumerate}
325+
326+
\subsubsection{Dirtiable Transition Delay}
327+
The first two-phase CRG model limitation introduced above can be addressed
328+
by batching Dirty to Dirtiable transitions at the end of the revocation pass,
329+
without additional synchronisation requirements.
330+
331+
First, consider that the first limitation arises when the program writes a
332+
capability through a Dirtiable PTE entry after the revoker has visitied the PTE,
333+
but before the revocation pass is finished.
334+
This may occur as a result of two conditions:
335+
\begin{enumerate}
336+
\item The revoker observed a Dirtiable PTE entry from the previous pass and
337+
left it in the Dirtiable state for the current epoch.
338+
\item The revoker has downgraded a Dirty PTE to Dirtiable as part of the visit.
339+
\end{enumerate}
340+
341+
When the program writes a capability through the Dirtiable PTE before the end of
342+
the revocation pass, the Sstatus.UDCRG is still set to the previous epoch
343+
(e.g. UDCRG=0, UCRG=1) and the CRG bit will be set to the previous epoch
344+
as a result.
345+
346+
This is a problem, because it is possible that the page is left untouched through
347+
the steady-state phase up to the next epoch. As the next epoch begins, we will
348+
flip UCRG again, so that UCRG for epochs $E_0$ and $E_2$ are the same.
349+
Under these conditions, the offending PTE entry will not be scanned for
350+
revocation and unrevoked capabilities from $E_1$ may leak to $E_2$.
351+
352+
The proposed solution derives from the observation that the invalid CRG update
353+
can only occur in the window of time between the revoker visit of the PTE and
354+
the end of the revocation pass, when the revoker leaves the PTE in the
355+
Dirtiable state.
356+
If we delay PTE transitions from Dirty to Dirtiable until the end of the
357+
revocation pass, after UDCRG catches up to UCRG, any write to the Dirtiable PTE
358+
will correctly promote it to Dirty in the correct epoch.
359+
360+
There are two important assumptions that must be noted here.
361+
\begin{enumerate}
362+
\item Delaying a Dirty to Dirtiable PTE transition is safe because the Dirtiable
363+
state is used to encode that the page \textit{may} be cap-clean. This means
364+
that the page will still be scanned during the next revocation pass and the
365+
revoker naturally tolerates racing capability writes between the page visit
366+
and the PTE update.
367+
\item The PTE update to Dirtiable does not require any synchronisation. In
368+
particular, it does not depend on the number of aliasing mappings nor it
369+
requires a TLB invalidation. This stems from the fact that the revoker already
370+
tolerates TLB cap-dirtiness drift to a degree.
371+
\end{enumerate}
372+
373+
Therefore, the cost of this solution should only be caused by the need to record
374+
the PTE entries that need to be moved to the Dirtiable state.
375+
A possible implementation could use a fixed-size array to track these PTE entries,
376+
so that memory allocation during the revocation pass is avoided.
377+
378+
Finally, although this is an implementation detail, when the Cornucopia Reloaded
379+
revoker encounters a Dirtiable PTE entry from a previous epoch, it never leaves
380+
it as Dirtiable. In particular, it is either moved to the Dirty state if a
381+
capability is found in the page, or to the Clean state if no
382+
capabilities are found.
383+
This makes this delay technique less problematic, because we can not allow
384+
Dirtiable PTE entries to exist between the time of visit and the end of the
385+
revocation pass.
386+
If the revoker leaves existing Dirtiable PTE entries as Dirtiable, we
387+
would need to temporarily promote them to Dirty and demote them again
388+
to Dirtiable as part of the batch operation at the end of the scan.
389+
111390
% >>>
112391
\section{Recursive Mutable Load Permission} % <<<
113392
\label{app:exp:recmutload}

cheri.bib

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16421,6 +16421,24 @@ @inproceedings{cornucopia
1642116421
recent = {true}
1642216422
}
1642316423

16424+
@inproceedings{cornucopia-reloaded,
16425+
author = {Filardo, Nathaniel Wesley and Gutstein, Brett F. and Woodruff, Jonathan and Clarke, Jessica and Rugg, Peter and Davis, Brooks and Johnston, Mark and Norton, Robert and Chisnall, David and Moore, Simon W. and Neumann, Peter G. and Watson, Robert N. M.},
16426+
title = {Cornucopia Reloaded: Load Barriers for CHERI Heap Temporal Safety},
16427+
year = {2024},
16428+
isbn = {9798400703850},
16429+
publisher = {Association for Computing Machinery},
16430+
address = {New York, NY, USA},
16431+
url = {https://doi.org/10.1145/3620665.3640416},
16432+
doi = {10.1145/3620665.3640416},
16433+
abstract = {Violations of temporal memory safety ("use after free", "UAF") continue to pose a significant threat to software security. The CHERI capability architecture has shown promise as a technology for C and C++ language reference integrity and spatial memory safety. Building atop CHERI, prior works - CHERIvoke and Cornucopia - have explored adding heap temporal safety. The most pressing limitation of Cornucopia was its impractical "stop-the-world" pause times.We present Cornucopia Reloaded, a re-designed drop-in replacement implementation of CHERI temporal safety, using a novel architectural feature - a per-page capability load barrier, added in Arm's Morello prototype CPU and CHERI-RISC-V - to nearly eliminate application pauses. We analyze the performance of Reloaded as well as Cornucopia and CHERIvoke on Morello, using the CHERI-compatible SPEC CPU2006 INT workloads to assess its impact on batch workloads and using pgbench and gRPC QPS as surrogate interactive workloads. Under Reloaded, applications no longer experience significant revocation-induced stop-the-world periods, without additional wall- or CPU-time cost over Cornucopia and with median 87\% of Cornucopia's DRAM traffic overheads across SPEC CPU2006 and < 50\% for pgbench.},
16434+
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
16435+
pages = {251–268},
16436+
numpages = {18},
16437+
keywords = {capability revocation, CHERI, temporal safety, use after free},
16438+
location = {La Jolla, CA, USA},
16439+
series = {ASPLOS '24}
16440+
}
16441+
1642416442
@TechReport{UCAM-CL-TR-940,
1642516443
author = {Nienhuis, Kyndylan and Joannou, Alexandre and Fox, Anthony
1642616444
and Roe, Michael and Bauereiss, Thomas and Campbell, Brian
@@ -16537,6 +16555,16 @@ @manual{arm-morello
1653716555
label = {Arm}
1653816556
}
1653916557

16558+
@manual{riscv-cheri-draft-0-9-3,
16559+
title={{RISC-V Specification for CHERI Extensions (v0.9.3 pre-release)}},
16560+
%url = {https://github.com/riscv/riscv-cheri/releases/tag/v0.9.3-prerelease},
16561+
organization = {RISC-V International},
16562+
year = 2025,
16563+
month = 01,
16564+
day = 16,
16565+
label = {}
16566+
}
16567+
1654016568
@inproceedings{margaritov2019prefetched,
1654116569
title={Prefetched address translation},
1654216570
author={Margaritov, Artemiy and Ustiugov, Dmitrii and Bugnion, Edouard and Grot, Boris},

0 commit comments

Comments
 (0)