@@ -108,6 +108,285 @@ \subsection{Non-Temporal (Streaming) CLC} % <<<
108108on the LL/SC link flag. In some toy examples, this seems to make full SMR
109109hazards unnecessary. Unclear that it is worth pursuing.}
110110
111+ % >>>
112+ \section {Reduced PTE Usage of Architectural Revocation } % <<<
113+ \label {app:exp:dirtycrg }
114+
115+ \subsection {Motivation } % <<<
116+ The load-barrier revocation model in the style of Cornucopia Reloaded
117+ \cite {cornucopia -reloaded } has been demonstrated with the CHERI PTE extensions
118+ described in \cref {subsection:riscv:pagetables }, as well as with the Morello
119+ PTE extensions \cite {arm -morello }.
120+ These solutions are considered superset implementations, where the bits
121+ allocated in the PTE tables allow a rich set of behaviours for experimentation.
122+ This comes at the cost of additional consumption of PTE bits, which are scarce.
123+
124+ As part of the CHERI RISC-V standardisation effort, it becomes necessary to
125+ implement hardware to efficiently support a Cornucopia Reloaded-style revoker,
126+ while minimising the PTE bit usage. To this end, the CHERI RISC-V standard draft
127+ 0.9.3 \cite {riscv -cheri -draft -0 -9 -3 } (henceforth referred to as Zcheri)
128+ introduces a 2-bit PTE extension that is meant to enable Cornucopia Reloaded
129+ revocation.
130+ This is a significant saving with respect to both the 5 bits extension introduced by
131+ \cref {subsection:riscv:pagetables } and the 4 bits extension in Morello,
132+ which focus on enabling experimentation.
133+ \amnote {Since presumably the details of previous sections may change, should I
134+ refer here to a specific CHERI ISA version? E.g. v9?}
135+
136+ \subsubsection {RISC-V Zcheri 0.9.3 PTE Extensions }
137+ The 2-bit PTE extension as specified in the CHERI RISC-V standard draft 0.9.3
138+ defines the CW and CRG bits. The CW bit is defined to control both capability
139+ load and store. The CRG bit has a similar behavior to the CHERI ISAv9 CRG bit;
140+ however, when the CW bit is clear, the CRG bit is overloaded to encode the
141+ capability-dirty tracking state as follows:
142+
143+ \begin {center }
144+ %
145+ \begin {tabular }{ccl}
146+ \textbf {CW } & \textbf {CRG } & \textbf {Load Behavior } \\
147+ 0 & 0 & Capability loads strip tags on loaded result \\
148+ 0 & 1 & Capability loads strip tags on loaded result \\
149+ 1 & X & Generational load barrier, trap on load if Sstatus.UCRG $ \neq $ CRG \\
150+ \end {tabular }
151+ %
152+ \begin {tabular }{ccl}
153+ \textbf {CW } & \textbf {CRG } & \textbf {Store Behavior } \\
154+ 0 & 0 & Trap on capability stores \\
155+ 0 & 1 & Track capability dirty, CW is set, CRG is set to Sstatus.UCRG \\
156+ 1 & X & Capability stores are unaltered \\
157+ \end {tabular }
158+ %
159+ \end {center }
160+
161+ For this discussion, we associate names to each combination of the \{ CW, CRG\}
162+ bits, according to the semantic of the Cornucopia Reloaded model.
163+ A PTE entry is in the \textit {Dirty } state when CW=1, in the
164+ \textit {Dirtiable } state when CW=0 CRG=1 and in the \textit {Clean } state when
165+ both CW=0 and CRG=0.
166+
167+ \subsubsection {Limitations of the Zcheri 0.9.3 PTE Extensions }
168+ The 2-bit PTE extension as specified in the CHERI RISC-V standard draft 0.9.3
169+ has some limitations that do not allow the Cornucopia Reloaded revoker model
170+ to be implemented. This is both a consequence of how the Reloaded revoker is
171+ tuned to reduce TLB shootdowns, as well as fundamental design choices of the
172+ Zcheri PTE bits extension.
173+
174+ The Zcheri PTE extension has two fundamental limitations:
175+ \begin {itemize }
176+ \item The CRG bit is overloaded to encode both the load-side generation and the
177+ capability-dirty tracking PTE state (also referred as Dirtiable
178+ state). This makes it impossible to retain the load-side generation
179+ information when a PTE entry is in the Dirtiable state.
180+ \item The Dirtiable PTE state interacts poorly with the CW bit
181+ semantic when dealing with aliasing pages. In particular, it is impossible
182+ to use the capability-dirty tracking state for aliasing pages that may
183+ contain capabilities.
184+ \end {itemize }
185+
186+ In particular, the Dirtiable state is problematic because the
187+ Reloaded revoker relies on two important properties:
188+ \begin {enumerate }
189+ \item The PTE can be configured to have load-side barrier semantic while in
190+ the Dirtiable state.
191+ \item The PTE transitions from Dirtiable to Dirty as a
192+ result of a capability store leave the CRG bit is unchanged. This
193+ encodes the generation for the entry, depending on whether the
194+ revoker scan has reached it or not.
195+ \end {enumerate }
196+
197+ The Reloaded revoker uses the Dirtiable state as an intermediate state
198+ while transitioning a (presumed) capability-clean page from
199+ \textit {occupied }\footnote {A page is occupied if it holds or may recently have
200+ held capabilities.} to \textit {idle }\footnote {A page is idle if it does not
201+ contain any capabilities, idle pages can be skipped by the revoker when scanning
202+ memory and play an important part in reducing the number of pages scanned.}.
203+ Because of the limitations outlined above, the revoker has to accept some
204+ trade-offs to use the Zcheri 0.9.3 PTE extension.
205+
206+ \begin {itemize }
207+ \item The Dirtiable state can not be used for aliasing pages.
208+ This is both because of the tag clearing semantic on loads and the
209+ CRG update rule.
210+ \item Transitions from Dirty to Dirtiable break the
211+ ability of the revoker to leave the TLB slightly cap-dirtier than
212+ the PTE entries. This means that additional TLB invalidations are needed.
213+ \end {itemize }
214+
215+ These trade-offs stem from an analysis of possible races between the program
216+ and the revoker, which result in violations of the Cornucopia Reloaded
217+ invariants.
218+
219+ \subsection {Two-phase CRG Model }
220+ The Two-phase CRG model is a derivation of the Zcheri 0.9.3 2-bit PTE extension
221+ with changes to the Dirtiable state semantic and the addition of an
222+ extra bit in the Sstatus register.
223+ The additional bit in Sstatus is significantly less impactful than introducing
224+ a new PTE bit and, in general, it can be placed in another CSR if Sstatus
225+ register bits become scarce.
226+
227+ The bit UDCRG (User Dirty CRG) is added to the Sstatus register and is intended
228+ to complement the existing UCRG (User CRG) bit.
229+ The UDCRG bit represents the epoch number that the revoker is currently closing,
230+ as opposed to the UCRG bit, which represents the epoch that is currently open.
231+ This fundamentally enables the revoker to communicate to the hardware that
232+ a revocation sweep is in progress. As an aside, it is fairly easy to model
233+ this in a slightly different way, using a `` Revocation In Progress''
234+ bit instead.
235+
236+ The \{ UDCRG, UCRG\} bit pair have the following architectural meaning:
237+ \begin {center }
238+ %
239+ \begin {tabular }{ccl}
240+ \textbf {UDCRG } & \textbf {UCRG } & \textbf {Behavior } \\
241+ 0 & 0 & Epoch $ E_0 $ steady state \\
242+ 0 & 1 & Revocation in progress for the epoch transition $ E_0 \rightarrow E_1 $ \\
243+ 1 & 0 & Revocation in progress for the epoch transition $ E_1 \rightarrow E_0 $ \\
244+ 1 & 1 & Epoch $ E_1 $ steady state \\
245+ \end {tabular }
246+ \end {center }
247+
248+ The UDCRG bit modifies the PTE CRG update rule when transitining from the
249+ Dirtiable (CW=0 CRG=1) state to the Dirty state.
250+ The behavior of the CW and CRG bits is modified as follows
251+ \begin {center }
252+ %
253+ \begin {tabular }{ccl}
254+ \textbf {CW } & \textbf {CRG } & \textbf {Load Behavior } \\
255+ 0 & 0 & Capability loads strip tags on loaded result \\
256+ 0 & 1 & Load fault when Sstatus.UCRG $ \neq $ Sstatus.UDCRG \\
257+ 1 & X & Generational load barrier, trap on load if Sstatus.UCRG $ \neq $ CRG \\
258+ \end {tabular }
259+ %
260+ \begin {tabular }{ccl}
261+ \textbf {CW } & \textbf {CRG } & \textbf {Store Behavior } \\
262+ 0 & 0 & Trap on capability stores \\
263+ 0 & 1 & Track capability dirty, CW is set, CRG is set to Sstatus.UDCRG \\
264+ 1 & X & Capability stores are unaltered \\
265+ \end {tabular }
266+ %
267+ \end {center }
268+
269+ \subsubsection {Rationale and Software Operation }
270+ The revoker is responsible for correctly switching the Sstatus.{UDCRG, UCRG}
271+ bits, in accordance with the revocation state machine.
272+ When moving from epoch $ E_0 $ to epoch $ E_1 $ , the revoker toggles the
273+ Sstatus.UCRG bit and begins the background scan. This is unchanged from the
274+ Reloaded model; however, the UDCRG bit remains set to the previous epoch.
275+ When the revoker completes the background scan, the UDCRG is set to UCRG;
276+ in other words, UDCRG `` catches up'' to the current epoch.
277+
278+ This effectively splits up the epoch into two phases. The \textit {revocation%
279+ phase } spans the period of time between the beginning of the epoch and the
280+ end of the background scan. The \textit {steady-state phase }, which starts when
281+ the background scan completes and lasts until the beginning of the next epoch.
282+
283+ The revocation hardware assistance mechanism outlined above is designed to
284+ mitigate the loss of CRG information in the Dirtiable PTE state.
285+ It is important to consider the Zcheri Dirtiable state behavior during both
286+ the steady-state and revocation phases.
287+
288+ In the steady-state phase, the Zcheri Dirtiable PTE behavior is never
289+ problematic because there is no ambiguity towards the load-side generation epoch
290+ associated to a PTE that becomes dirty: if a capability is written via a
291+ dirtiable mapping, the PTE entry should be considered Dirty at the current epoch.
292+
293+ The Zcheri Dirtiable state is problematic during the revocation phase.
294+ In particular, when a capability is written via a Dirtiable PTE, the new value
295+ of the CRG bit is undecidable without additional information.
296+ Consider the following two cases:
297+ \begin {enumerate }
298+ \item The background scan has not visited the page: the CRG bit should be set
299+ to the previous epoch CRG, so that the background scan will correctly detect
300+ that the PTE has not been visited yet.
301+ \item The background scan has visited the page: in this case the background scan
302+ has left the PTE in the Dirtiable state, so a subsequent write should upgrade
303+ the PTE to Dirty in the current epoch.
304+ \end {enumerate }
305+ It is now clear that the problem arises from the fact that the CRG update
306+ behavior is dependent on whether the revoker has visited the page or not,
307+ which can not be determined by observing a Dirtiable PTE entry alone.
308+
309+ The 2-phase CRG model attempts to recover this information without introducing
310+ additional PTE state. This comes at the cost of some changes to the Reloaded
311+ revoker.
312+ This introduces an hypothesis that these changes are less costly and invasive
313+ than other options, such as using a 3-bit PTE solution or implementing a
314+ revoker that does not rely on the Dirtiable state in this way.
315+
316+ The two-phase CRG model has some known limitations that require changes to the
317+ existing revoker implementations.
318+
319+ \begin {enumerate }
320+ \item A capability write via a Dirtiable PTE that has already been observed by
321+ the background scan breaks the load-side invariant in the next epoch.
322+ \item The transition of a page to Idle may require a TLB invalidation because
323+ there may be Dirtiable TLB entries.
324+ \end {enumerate }
325+
326+ \subsubsection {Dirtiable Transition Delay }
327+ The first two-phase CRG model limitation introduced above can be addressed
328+ by batching Dirty to Dirtiable transitions at the end of the revocation pass,
329+ without additional synchronisation requirements.
330+
331+ First, consider that the first limitation arises when the program writes a
332+ capability through a Dirtiable PTE entry after the revoker has visitied the PTE,
333+ but before the revocation pass is finished.
334+ This may occur as a result of two conditions:
335+ \begin {enumerate }
336+ \item The revoker observed a Dirtiable PTE entry from the previous pass and
337+ left it in the Dirtiable state for the current epoch.
338+ \item The revoker has downgraded a Dirty PTE to Dirtiable as part of the visit.
339+ \end {enumerate }
340+
341+ When the program writes a capability through the Dirtiable PTE before the end of
342+ the revocation pass, the Sstatus.UDCRG is still set to the previous epoch
343+ (e.g. UDCRG=0, UCRG=1) and the CRG bit will be set to the previous epoch
344+ as a result.
345+
346+ This is a problem, because it is possible that the page is left untouched through
347+ the steady-state phase up to the next epoch. As the next epoch begins, we will
348+ flip UCRG again, so that UCRG for epochs $ E_0 $ and $ E_2 $ are the same.
349+ Under these conditions, the offending PTE entry will not be scanned for
350+ revocation and unrevoked capabilities from $ E_1 $ may leak to $ E_2 $ .
351+
352+ The proposed solution derives from the observation that the invalid CRG update
353+ can only occur in the window of time between the revoker visit of the PTE and
354+ the end of the revocation pass, when the revoker leaves the PTE in the
355+ Dirtiable state.
356+ If we delay PTE transitions from Dirty to Dirtiable until the end of the
357+ revocation pass, after UDCRG catches up to UCRG, any write to the Dirtiable PTE
358+ will correctly promote it to Dirty in the correct epoch.
359+
360+ There are two important assumptions that must be noted here.
361+ \begin {enumerate }
362+ \item Delaying a Dirty to Dirtiable PTE transition is safe because the Dirtiable
363+ state is used to encode that the page \textit {may } be cap-clean. This means
364+ that the page will still be scanned during the next revocation pass and the
365+ revoker naturally tolerates racing capability writes between the page visit
366+ and the PTE update.
367+ \item The PTE update to Dirtiable does not require any synchronisation. In
368+ particular, it does not depend on the number of aliasing mappings nor it
369+ requires a TLB invalidation. This stems from the fact that the revoker already
370+ tolerates TLB cap-dirtiness drift to a degree.
371+ \end {enumerate }
372+
373+ Therefore, the cost of this solution should only be caused by the need to record
374+ the PTE entries that need to be moved to the Dirtiable state.
375+ A possible implementation could use a fixed-size array to track these PTE entries,
376+ so that memory allocation during the revocation pass is avoided.
377+
378+ Finally, although this is an implementation detail, when the Cornucopia Reloaded
379+ revoker encounters a Dirtiable PTE entry from a previous epoch, it never leaves
380+ it as Dirtiable. In particular, it is either moved to the Dirty state if a
381+ capability is found in the page, or to the Clean state if no
382+ capabilities are found.
383+ This makes this delay technique less problematic, because we can not allow
384+ Dirtiable PTE entries to exist between the time of visit and the end of the
385+ revocation pass.
386+ If the revoker leaves existing Dirtiable PTE entries as Dirtiable, we
387+ would need to temporarily promote them to Dirty and demote them again
388+ to Dirtiable as part of the batch operation at the end of the scan.
389+
111390% >>>
112391\section {Recursive Mutable Load Permission } % <<<
113392\label {app:exp:recmutload }
0 commit comments