-
Notifications
You must be signed in to change notification settings - Fork 142
Add Caching backend to Trustee and use it for caching AMD VCEKs #989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0da0952 to
2287917
Compare
Add an interface for caching in the attestation service. Often verifying an attestation involves fetching some information (such as a VCEK). When verifying the same evidence or evidence from the same host many times, this data can be cached. Here we introduce a very simple caching trait and a simple caching backend. Note that the cache can also be provisioned via a config file at startup. This allows for pre-provisioning certificates in offline environments. The caching interface is added as its own package in deps to avoid circular dependencies between the AS and the Verifiers, which both use cache code. This will also allow us to re-use the cache backends for the KBS and RVPS if needed. Signed-off-by: Tobin Feldman-Fitzthum <[email protected]>
Let's use the cache for VCEKs. This will reduce traffic to the KDS, eliminate timeouts when we retry the attestation several times in succession, and allow air-gapped environments to pre-provsion VCEKs. Use the URL for the KDS as the key, since it has all the information needed to define a particular VCEK. Signed-off-by: Tobin Feldman-Fitzthum <[email protected]>
This is done differently for other architectures which means the Trustee user experience is not consistent. Did you consider an option to have an external "caching service" for SNP? |
| use crate::rvps::RvpsConfig; | ||
| use crate::token::AttestationTokenConfig; | ||
|
|
||
| use cache::CacheConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a doc for the caching backend will be helpful, including some guidance on implementing other cache types like local directory with pre-populated certs.
I mean as soon as any platform has its own external caching, the user experience is not going to be consistent. Either way this PR just adds an additional option for people. You could even use this alongside some external service. I am not thinking this is an SNP-specific feature. That is just the platform that I happen to know the details of. Hopefully other verifiers will follow. |
Besides SNP offline attestation, what other use-cases you have in mind? Asking users to add arbitrary blobs in global config sounds a bit odd. TDX uses external caching and I cannot think of ways to move to use this features. That's the reason I was asking did you consider anything similar. It would be more consistent for users to have verifiers to config vendor service or a caching proxy URL... |
This is also important for normal SNP attestation. Pre-provisioning is more of an edge case. There is approximately one VCEK per node meaning that many attestation requests will have the same one. There is no need to contact the KDS for every request. Beyond that the KDS has some rate limiting features that will block anyone making too many requests at once. We run into this when the first attestation fails and either the RCAR code in the AA or Kubernetes retries. This leads to 429 errors in the CI, which are misleading. We can implement something very similar for CSV, which has the same architecture. I'm not sure about other platforms, but fetching certs like this is relatively common. I think it's nice to give users this feature directly as part of Trustee without having to set anything up. |
|
We can implement an external caching feature for SNP VEKs (and we have plans to look into this), but this PR feature could help alleviate some of the issues we've seen with the KDS rate limiting we've seen that @fitzthum mentioned. I think it's beneficial to have this for current support and add on later with an external cache feature. |
|
I was not very clear. I see value in runtime buffering of values but that is hidden from the users. My main feedback was about |
|
Ok. I mentioned something related in the NVIDIA Verifier PR (which adds a verifier config via the AS config). I don't think verifiers should have out-of-band configuration files. Ideally all of the configuration will go through one file. I think it's nicer for users to just edit the AS config rather than having to figure out where some other file should go and how to format it. It's true that for users with a lot of VCEKs it might be clunky to put all of these in the config file. In these cases we would probably want to implement another cache backend that uses the filesystem or even a database to store things persistently. This is future work. For now setting via the config file is a simple way to address the offline use case, which we know is significant. |
Xynnn007
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider a scenario where Intel TDX provides a hardware-endorsed certificate (PCK) caching service implementation (PCCS). Typically, cloud vendors maintain such a service to centrally manage the cloud cache (avoiding the single point of bottleneck caused by requesting Intel's official PCS, similar to AMD KDS). The rationale is that cloud machines frequently access these endorsement keys and certificates, eliminating the need to maintain multiple caching services.
In the AMD scenario (including CSV), while the KDS access API is relatively transparent, I still believe that implementing a CEK key caching service is AMD's responsibility. We can have the SNP verifier connect to a designated caching service; this caching service can be maintained by the user or the cloud.
A future I personally am more optimistic about is if the query interface for hardware endorsement can be unified (potentially with https://www.ietf.org/archive/id/draft-howard-rats-coserv-04.html and it will take a very long time and might fail), we can have a unified endorsement cache service implementation, but this implementation does not have to be included in KBS. We can open a new project, such as in CoCo or anywhere else to maintain it. The purpose of this implementation is to unify different architectures and vendors (PCCS, KDS cache, etc.)
I agree. I have gotten the necessary changes implemented on Intel DCAP side to be able to support this in Trustee and to get rid of the user-facing |
mkulke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is also https://crates.io/crates/http-cache-reqwest which is a bit less manual. It uses different heuristics, but it worked for me on KDS
I see where you're coming from, but practically speaking, we have a number of people asking for support for this (there are already downstream patches for something similar) and AMD have made no indication that they are going to support anything. One thing to think about is that for AMD the VCEK can be supplied from the host at boot via an "extended report" or from the verifier. We don't have support on the host side currently (we used to before some kernel APIs changed), but I think in cases where the cloud caches the VCEK that is where they would plug in. I see these two approaches as complimentary. Maybe some clouds will cache the VCEK via the extended report feature. There are also cases where a client might want to load the vceks for just a few nodes in an offline environment. I don't see any issue with supporting both especially since there is currently no other alternative. |
Yes, who implements the code does not matter, and obtaining endorsement keys is a common scenario. My concern is that maintaining endorsement key management code in the AS, especially the Verifier (which is better to be stateless, thus code logic and storage should be decoupled, hopefully), will lead to coupled code logic and extra maintaince burden. I have a few suggestions for your reference.
|
|
this is the example I used, I think it uses a bunch of can-I-cache-this heuristics, it worked for me use http_cache_reqwest::{CACacheManager, Cache, CacheMode, HttpCache, HttpCacheOptions};
use reqwest::Client;
use reqwest_middleware::{ClientBuilder, Result};
const URL: &str = "https://kdsintf.amd.com/vcek/v1/Milan/3ac3fe21e13fb0990eb28a802e3fb6a29483a6b0753590c951bdd3b8e53786184ca39e359669a2b76a1936776b564ea464cdce40c05f63c9b610c5068b006b5d";
#[tokio::main]
async fn main() -> Result<()> {
let client = ClientBuilder::new(Client::new())
.with(Cache(HttpCache {
mode: CacheMode::Default,
manager: CACacheManager::new("./cache".into(), true),
options: HttpCacheOptions::default(),
}))
.build();
println!("first attempt!");
client.get(URL).send().await?; // <= takes long
println!("second attempt!");
client.get(URL).send().await?; // <= returns quickly
println!("third attempt!");
client.get(URL).send().await?; // <= returns quickly
Ok(())
} |
It looks like the local cache stores things as binary, so it would be a bit tricky to pre-provision this. I think if we leave caching to some other service to figure out, there won't be an answer for it anytime soon. Lack of caching causes problems for us with users and in the CI today. I don't really see any advantages to using a separate service or any big reasons not to implement caching in Trustee itself (besides some added complexity). Either way, I'm not going to push on this too hard. If we want some other approach that is fine. People who would value this feature please chime in @bpradipt @ryansavino |
|
Ok, if you want to pre-provision keys, both by a new cache crate, or reqwest extension we need some operations or a script to do this. (local cache we need to download ceks, run command to insert them to the storage; reqwest, we need to run command to pre-download the ceks, and copy the cache binary to specific path). When I tried to refactor all storage things in trustee, I met some overlap work as this. There is a comment here. please take a look if we could also use a common db for this? |
|
Btw a somewhat related issue is whether the verifiers should be stateful at all. Currently we create a new verifier for each request to the AS. This means that today the only way to have any kind of persistent state with the verifier would be through some kind of cache or storage. |
This is hard to have a clean border. In code we have been stateless for now. But for TDX/SGX, its underlying DCAP library will connect a cache service PCCS that stores the endorsements of the PCKs. As we have a verifier config now, it's accept to specify an outer service or API who stores the state to be connected, or a path that stores some state I think. |
|
Was talking to @amd-aliem, seems like this PR implementation is pretty simple and easy to configure for multiple use cases. We're planning to implement a more permanent caching service solution, but was thinking this was a good interim solution. Can we revive this PR and merge? @amd-aliem has tested this PR and confirmed everything is working. We can add documentation describing this too. |
|
hi @ryansavino Regretfully The code of this PR is somehow outdated and conflicts with current code, so we might need to update the PR or make a new PR to support this. Based on the above discussion, there are currently two methods to implement VCEK caching. Luckily neither of them is hard to implement
A question for you is which one do you prefer? cc @fitzthum |
|
I don't think option 1 (KV cache store) is a good idea if we're planning on implementing a more comprehensive caching service. Support for the database, migrating, updating, backwards compatibility will be more than I think we want to deal with in regards to a barebones or temporary solution. Option 2 sounds okay. I'm not too sure what exactly is involved here, or how different it would be from this PR. But I think a configurable simple file cache is the best way to configure this until we have support for a more comprehensive service. I think the implementation in this PR makes it simpler to load vceks initially. |
|
The conflicts in this PR are probably mostly related to some extra fields being added to the config file. Shouldn't be too difficult to resolve. There is a bigger question about the relationship between the cache and the KV Storage. They are similar, but a bit different. For instance, we don't really need the cache to be persistent or shared between Trustee instances. On the other hand it would be nice if there were an easy way to pre-provision the cache, which we don't entirely have with storage. So I think they are two different things and I wouldn't mind having both abstractions. We could also have a caching backend that points to KV Storage. That said, this does all add complexity so maybe it's fine to just use the second option even though it's not as nice for pre-provisioning. I don't have a strong opinion here. |
|
@fitzthum I can help to take a view of provisioning things based on reqwest cache. Principlely, it should not be that hard to use some simple scripts or logic to convert a vcek to a cache. |
|
btw #784 seems related topic. |
Yes, definitely related, but I think Larry moved to a different company. |
|
Prototyped & tested a solution following the example from @mkulke #989 (comment) - seems to work fine, for my setup I added a nw rule to block traffic to the amd kds and saw it fail w/o prototype enabled and succeed with it enabled. Likely will change the config format to be more specific (so it's not too big of a deal to keep supporting it alongside new better options in the future), but would appreciate some feedback on this doc to see if there are glaring issues w/ design: https://github.com/amd-aliem/trustee/blob/amd-cert-cache/attestation-service/docs/amd-offline-certificate-cache.md If not I'll open a PR for review new week hopefully |
|
@amd-aliem Thanks for this! That's the way I prefer to go and I'd like to take a review. |
Introduce a Cache backend to Trustee. This is added as a new package in
depsto avoid circular dependencies (since we use the Cache in the AS and the Verifier). We can use the same cache logic for the KBS and RVPS in the future.Introduce one simple caching backend which is just a HashMap. This doesn't support any kind of eviction or expiration, but that can be implemented in a more complex caching backend.
The AS config file can be populated with initial cache values that will be added to whatever caching backend is used at startup. This allows cache values (such as VCEKs) to be pre-provisioned in offline environments. cc @bpradipt
Update the SNP verifier to use the cache when fetching the VCEK. Use the VCEK URL as the key for the cache. I haven't tested this with an SNP system yet. cc @ryansavino