Add Caching backend to Trustee and use it for caching AMD VCEKs #989

fitzthum · 2025-10-07T17:24:46Z

Introduce a Cache backend to Trustee. This is added as a new package in deps to avoid circular dependencies (since we use the Cache in the AS and the Verifier). We can use the same cache logic for the KBS and RVPS in the future.

Introduce one simple caching backend which is just a HashMap. This doesn't support any kind of eviction or expiration, but that can be implemented in a more complex caching backend.

The AS config file can be populated with initial cache values that will be added to whatever caching backend is used at startup. This allows cache values (such as VCEKs) to be pre-provisioned in offline environments. cc @bpradipt

Update the SNP verifier to use the cache when fetching the VCEK. Use the VCEK URL as the key for the cache. I haven't tested this with an SNP system yet. cc @ryansavino

Add an interface for caching in the attestation service. Often verifying an attestation involves fetching some information (such as a VCEK). When verifying the same evidence or evidence from the same host many times, this data can be cached. Here we introduce a very simple caching trait and a simple caching backend. Note that the cache can also be provisioned via a config file at startup. This allows for pre-provisioning certificates in offline environments. The caching interface is added as its own package in deps to avoid circular dependencies between the AS and the Verifiers, which both use cache code. This will also allow us to re-use the cache backends for the KBS and RVPS if needed. Signed-off-by: Tobin Feldman-Fitzthum <[email protected]>

Let's use the cache for VCEKs. This will reduce traffic to the KDS, eliminate timeouts when we retry the attestation several times in succession, and allow air-gapped environments to pre-provsion VCEKs. Use the URL for the KDS as the key, since it has all the information needed to define a particular VCEK. Signed-off-by: Tobin Feldman-Fitzthum <[email protected]>

mythi · 2025-10-08T06:04:34Z

The AS config file can be populated with initial cache values that will be added to whatever caching backend is used at startup. This allows cache values (such as VCEKs) to be pre-provisioned in offline environments

This is done differently for other architectures which means the Trustee user experience is not consistent. Did you consider an option to have an external "caching service" for SNP?

bpradipt · 2025-10-08T13:02:23Z

attestation-service/src/config.rs

 use crate::rvps::RvpsConfig;
 use crate::token::AttestationTokenConfig;

+use cache::CacheConfig;


Adding a doc for the caching backend will be helpful, including some guidance on implementing other cache types like local directory with pre-populated certs.

fitzthum · 2025-10-08T13:44:57Z

This is done differently for other architectures which means the Trustee user experience is not consistent. Did you consider an option to have an external "caching service" for SNP?

I mean as soon as any platform has its own external caching, the user experience is not going to be consistent. Either way this PR just adds an additional option for people. You could even use this alongside some external service. I am not thinking this is an SNP-specific feature. That is just the platform that I happen to know the details of. Hopefully other verifiers will follow.

mythi · 2025-10-09T15:48:55Z

This is done differently for other architectures which means the Trustee user experience is not consistent. Did you consider an option to have an external "caching service" for SNP?

I mean as soon as any platform has its own external caching, the user experience is not going to be consistent. Either way this PR just adds an additional option for people. You could even use this alongside some external service. I am not thinking this is an SNP-specific feature. That is just the platform that I happen to know the details of. Hopefully other verifiers will follow.

Besides SNP offline attestation, what other use-cases you have in mind? Asking users to add arbitrary blobs in global config sounds a bit odd. TDX uses external caching and I cannot think of ways to move to use this features. That's the reason I was asking did you consider anything similar. It would be more consistent for users to have verifiers to config vendor service or a caching proxy URL...

fitzthum · 2025-10-09T18:17:48Z

Besides SNP offline attestation, what other use-cases you have in mind?

This is also important for normal SNP attestation. Pre-provisioning is more of an edge case. There is approximately one VCEK per node meaning that many attestation requests will have the same one. There is no need to contact the KDS for every request. Beyond that the KDS has some rate limiting features that will block anyone making too many requests at once. We run into this when the first attestation fails and either the RCAR code in the AA or Kubernetes retries. This leads to 429 errors in the CI, which are misleading.

We can implement something very similar for CSV, which has the same architecture. I'm not sure about other platforms, but fetching certs like this is relatively common.

I think it's nice to give users this feature directly as part of Trustee without having to set anything up.

ryansavino · 2025-10-09T19:01:35Z

We can implement an external caching feature for SNP VEKs (and we have plans to look into this), but this PR feature could help alleviate some of the issues we've seen with the KDS rate limiting we've seen that @fitzthum mentioned. I think it's beneficial to have this for current support and add on later with an external cache feature.

mythi · 2025-10-10T09:03:39Z

I was not very clear. I see value in runtime buffering of values but that is hidden from the users. My main feedback was about CacheConfig and how that looks for the users. So far verifiers have implemented their own config logic (TDX and CCA ) for similar needs. Should the loading of offline VCEK into the cache be something snp-verifier takes care of so that the AS config does not need to change. Alternatively we could look to have a config structure that all verifiers could plug into?

fitzthum · 2025-10-10T18:40:09Z

Ok. I mentioned something related in the NVIDIA Verifier PR (which adds a verifier config via the AS config). I don't think verifiers should have out-of-band configuration files. Ideally all of the configuration will go through one file. I think it's nicer for users to just edit the AS config rather than having to figure out where some other file should go and how to format it.

It's true that for users with a lot of VCEKs it might be clunky to put all of these in the config file. In these cases we would probably want to implement another cache backend that uses the filesystem or even a database to store things persistently. This is future work. For now setting via the config file is a simple way to address the offline use case, which we know is significant.

Xynnn007

Let's consider a scenario where Intel TDX provides a hardware-endorsed certificate (PCK) caching service implementation (PCCS). Typically, cloud vendors maintain such a service to centrally manage the cloud cache (avoiding the single point of bottleneck caused by requesting Intel's official PCS, similar to AMD KDS). The rationale is that cloud machines frequently access these endorsement keys and certificates, eliminating the need to maintain multiple caching services.

In the AMD scenario (including CSV), while the KDS access API is relatively transparent, I still believe that implementing a CEK key caching service is AMD's responsibility. We can have the SNP verifier connect to a designated caching service; this caching service can be maintained by the user or the cloud.

A future I personally am more optimistic about is if the query interface for hardware endorsement can be unified (potentially with https://www.ietf.org/archive/id/draft-howard-rats-coserv-04.html and it will take a very long time and might fail), we can have a unified endorsement cache service implementation, but this implementation does not have to be included in KBS. We can open a new project, such as in CoCo or anywhere else to maintain it. The purpose of this implementation is to unify different architectures and vendors (PCCS, KDS cache, etc.)

mythi · 2025-10-13T05:50:22Z

Ok. I mentioned something related in the NVIDIA Verifier PR (which adds a verifier config via the AS config). I don't think verifiers should have out-of-band configuration files. Ideally all of the configuration will go through one file. I think it's nicer for users to just edit the AS config rather than having to figure out where some other file should go and how to format it.

I agree. I have gotten the necessary changes implemented on Intel DCAP side to be able to support this in Trustee and to get rid of the user-facing /etc/sgx_default_qcnl.conf if we had some generic CollateralConfig AS config that provides the necessary info. This could include service url/tls config, or file paths for this VCEK case.

mkulke

there is also https://crates.io/crates/http-cache-reqwest which is a bit less manual. It uses different heuristics, but it worked for me on KDS

fitzthum · 2025-10-13T16:15:26Z

In the AMD scenario (including CSV), while the KDS access API is relatively transparent, I still believe that implementing a CEK key caching service is AMD's responsibility. We can have the SNP verifier connect to a designated caching service; this caching service can be maintained by the user or the cloud.

I see where you're coming from, but practically speaking, we have a number of people asking for support for this (there are already downstream patches for something similar) and AMD have made no indication that they are going to support anything.

One thing to think about is that for AMD the VCEK can be supplied from the host at boot via an "extended report" or from the verifier. We don't have support on the host side currently (we used to before some kernel APIs changed), but I think in cases where the cloud caches the VCEK that is where they would plug in. I see these two approaches as complimentary. Maybe some clouds will cache the VCEK via the extended report feature. There are also cases where a client might want to load the vceks for just a few nodes in an offline environment. I don't see any issue with supporting both especially since there is currently no other alternative.

Xynnn007 · 2025-10-14T02:33:22Z

I see where you're coming from, but practically speaking, we have a number of people asking for support for this (there are already downstream patches for something similar) and AMD have made no indication that they are going to support anything.

Yes, who implements the code does not matter, and obtaining endorsement keys is a common scenario. My concern is that maintaining endorsement key management code in the AS, especially the Verifier (which is better to be stateless, thus code logic and storage should be decoupled, hopefully), will lead to coupled code logic and extra maintaince burden. I have a few suggestions for your reference.

In the long term, it would be best to implement a separate AMD KDS cache implementation with the same API as KDS. We could configure the KDS address in the SNP verifier configuration file to point to this cache service. We can also, as I said, looking forward to a general endorsement cache service.
In the short term, we can solve the problem as simply as possible by using the external HTTP cache library mentioned by @mkulke . It supports a local cache path, which can be configured in the current SNP Verifier configuration file. There is an example code for request to use StreamingCache (a cache based on local fs)

mkulke · 2025-10-14T08:23:41Z

this is the example I used, I think it uses a bunch of can-I-cache-this heuristics, it worked for me

use http_cache_reqwest::{CACacheManager, Cache, CacheMode, HttpCache, HttpCacheOptions};
use reqwest::Client;
use reqwest_middleware::{ClientBuilder, Result};

const URL: &str = "https://kdsintf.amd.com/vcek/v1/Milan/3ac3fe21e13fb0990eb28a802e3fb6a29483a6b0753590c951bdd3b8e53786184ca39e359669a2b76a1936776b564ea464cdce40c05f63c9b610c5068b006b5d";

#[tokio::main]
async fn main() -> Result<()> {
    let client = ClientBuilder::new(Client::new())
        .with(Cache(HttpCache {
            mode: CacheMode::Default,
            manager: CACacheManager::new("./cache".into(), true),
            options: HttpCacheOptions::default(),
        }))
        .build();
    println!("first attempt!"); 
    client.get(URL).send().await?; // <= takes long
    println!("second attempt!");
    client.get(URL).send().await?; // <= returns quickly
    println!("third attempt!");
    client.get(URL).send().await?; // <= returns quickly
    Ok(())
}

fitzthum · 2025-10-16T16:23:35Z

In the short term, we can solve the problem as simply as possible by using the external HTTP cache library mentioned by @mkulke . It supports a local cache path, which can be configured in the current SNP Verifier configuration file. There is an example code for request to use StreamingCache (a cache based on local fs)

It looks like the local cache stores things as binary, so it would be a bit tricky to pre-provision this. I think if we leave caching to some other service to figure out, there won't be an answer for it anytime soon. Lack of caching causes problems for us with users and in the CI today. I don't really see any advantages to using a separate service or any big reasons not to implement caching in Trustee itself (besides some added complexity). Either way, I'm not going to push on this too hard. If we want some other approach that is fine.

People who would value this feature please chime in @bpradipt @ryansavino

Xynnn007 · 2025-10-17T03:43:43Z

Ok, if you want to pre-provision keys, both by a new cache crate, or reqwest extension we need some operations or a script to do this. (local cache we need to download ceks, run command to insert them to the storage; reqwest, we need to run command to pre-download the ceks, and copy the cache binary to specific path).

When I tried to refactor all storage things in trustee, I met some overlap work as this. There is a comment here.

please take a look if we could also use a common db for this?

fitzthum · 2025-10-17T14:11:15Z

Btw a somewhat related issue is whether the verifiers should be stateful at all. Currently we create a new verifier for each request to the AS. This means that today the only way to have any kind of persistent state with the verifier would be through some kind of cache or storage.

Xynnn007 · 2025-10-18T05:52:33Z

Btw a somewhat related issue is whether the verifiers should be stateful at all. Currently we create a new verifier for each request to the AS. This means that today the only way to have any kind of persistent state with the verifier would be through some kind of cache or storage.

This is hard to have a clean border. In code we have been stateless for now. But for TDX/SGX, its underlying DCAP library will connect a cache service PCCS that stores the endorsements of the PCKs. As we have a verifier config now, it's accept to specify an outer service or API who stores the state to be connected, or a path that stores some state I think.

ryansavino · 2025-10-30T19:27:01Z

Was talking to @amd-aliem, seems like this PR implementation is pretty simple and easy to configure for multiple use cases. We're planning to implement a more permanent caching service solution, but was thinking this was a good interim solution. Can we revive this PR and merge? @amd-aliem has tested this PR and confirmed everything is working. We can add documentation describing this too.

Xynnn007 · 2025-10-31T02:01:40Z

hi @ryansavino Regretfully The code of this PR is somehow outdated and conflicts with current code, so we might need to update the PR or make a new PR to support this. Based on the above discussion, there are currently two methods to implement VCEK caching. Luckily neither of them is hard to implement

Using a KV-cache store. We currently have a KV-Cache crate independent on file-system that can assist in this functionality; we need to add the database (postgres) to the Trustee cluster to work as the KV-cache. If cache migration is required, the content in the database needs to be migrated to the corresponding machine's database. It's easy to pre-configure some VCEKs before first run, in a way that pre-download the VCEKs and use sql to insert.
Implementing caching through the simple reqwest library's caching mechanism. The code can be much simpler than 1. If cache migration is required, a local directory needs to be copied to the corresponding machine. It's not that easy to pro-configure VCEKs before first run , but a a client tool can help.

A question for you is which one do you prefer?

cc @fitzthum

ryansavino · 2025-10-31T16:24:07Z

I don't think option 1 (KV cache store) is a good idea if we're planning on implementing a more comprehensive caching service. Support for the database, migrating, updating, backwards compatibility will be more than I think we want to deal with in regards to a barebones or temporary solution.

Option 2 sounds okay. I'm not too sure what exactly is involved here, or how different it would be from this PR. But I think a configurable simple file cache is the best way to configure this until we have support for a more comprehensive service. I think the implementation in this PR makes it simpler to load vceks initially.

fitzthum · 2025-10-31T16:31:53Z

The conflicts in this PR are probably mostly related to some extra fields being added to the config file. Shouldn't be too difficult to resolve.

There is a bigger question about the relationship between the cache and the KV Storage. They are similar, but a bit different. For instance, we don't really need the cache to be persistent or shared between Trustee instances. On the other hand it would be nice if there were an easy way to pre-provision the cache, which we don't entirely have with storage. So I think they are two different things and I wouldn't mind having both abstractions. We could also have a caching backend that points to KV Storage.

That said, this does all add complexity so maybe it's fine to just use the second option even though it's not as nice for pre-provisioning. I don't have a strong opinion here.

Xynnn007 · 2025-11-01T06:38:22Z

@fitzthum I can help to take a view of provisioning things based on reqwest cache. Principlely, it should not be that hard to use some simple scripts or logic to convert a vcek to a cache.

mythi · 2025-11-07T06:25:58Z

btw #784 seems related topic.

fitzthum · 2025-11-07T14:13:47Z

btw #784 seems related topic.

Yes, definitely related, but I think Larry moved to a different company.

amd-aliem · 2025-11-07T22:27:38Z

Prototyped & tested a solution following the example from @mkulke #989 (comment) - seems to work fine, for my setup I added a nw rule to block traffic to the amd kds and saw it fail w/o prototype enabled and succeed with it enabled.

Likely will change the config format to be more specific (so it's not too big of a deal to keep supporting it alongside new better options in the future), but would appreciate some feedback on this doc to see if there are glaring issues w/ design: https://github.com/amd-aliem/trustee/blob/amd-cert-cache/attestation-service/docs/amd-offline-certificate-cache.md

If not I'll open a PR for review new week hopefully

Xynnn007 · 2025-11-10T01:11:46Z

@amd-aliem Thanks for this! That's the way I prefer to go and I'd like to take a review.

fitzthum requested a review from a team as a code owner October 7, 2025 17:24

fitzthum force-pushed the caching branch 3 times, most recently from 0da0952 to 2287917 Compare October 7, 2025 20:55

fitzthum added 2 commits October 7, 2025 13:58

fitzthum force-pushed the caching branch from 2287917 to 8896ada Compare October 7, 2025 20:58

bpradipt reviewed Oct 8, 2025

View reviewed changes

Xynnn007 reviewed Oct 11, 2025

View reviewed changes

mkulke reviewed Oct 13, 2025

View reviewed changes

fitzthum mentioned this pull request Oct 13, 2025

Add TPM verifier #851

Merged

Xynnn007 mentioned this pull request Oct 17, 2025

Add deps/key-value-storage crate #1000

Merged

amd-aliem mentioned this pull request Nov 14, 2025

snp-verifier: http_cache_reqwest to cache vceks #1061

Merged

Add Caching backend to Trustee and use it for caching AMD VCEKs #989

Are you sure you want to change the base?

Add Caching backend to Trustee and use it for caching AMD VCEKs #989

Uh oh!

Conversation

fitzthum commented Oct 7, 2025

Uh oh!

mythi commented Oct 8, 2025

Uh oh!

bpradipt Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

fitzthum commented Oct 8, 2025

Uh oh!

mythi commented Oct 9, 2025

Uh oh!

fitzthum commented Oct 9, 2025

Uh oh!

ryansavino commented Oct 9, 2025

Uh oh!

mythi commented Oct 10, 2025

Uh oh!

fitzthum commented Oct 10, 2025

Uh oh!

Xynnn007 left a comment

Choose a reason for hiding this comment

Uh oh!

mythi commented Oct 13, 2025

Uh oh!

mkulke left a comment

Choose a reason for hiding this comment

Uh oh!

fitzthum commented Oct 13, 2025

Uh oh!

Xynnn007 commented Oct 14, 2025

Uh oh!

mkulke commented Oct 14, 2025

Uh oh!

fitzthum commented Oct 16, 2025

Uh oh!

Xynnn007 commented Oct 17, 2025

Uh oh!

fitzthum commented Oct 17, 2025

Uh oh!

Xynnn007 commented Oct 18, 2025

Uh oh!

ryansavino commented Oct 30, 2025

Uh oh!

Xynnn007 commented Oct 31, 2025

Uh oh!

ryansavino commented Oct 31, 2025

Uh oh!

fitzthum commented Oct 31, 2025

Uh oh!

Xynnn007 commented Nov 1, 2025

Uh oh!

mythi commented Nov 7, 2025

Uh oh!

fitzthum commented Nov 7, 2025

Uh oh!

amd-aliem commented Nov 7, 2025

Uh oh!

Xynnn007 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants