-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] webhooks #7277
base: main
Are you sure you want to change the base?
[nexus] webhooks #7277
Conversation
51f7f8e
to
139cfe6
Compare
140aea4
to
0b80c8f
Compare
41cf0b0
to
2bc5925
Compare
I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that. I'd like to have a way to include a couple "test" variants in there that aren't exposed in the public API, so I'll be giving some thought to how to deal with that. |
Glob subscription entries in |
As far as GCing old events from the event table, dispatching an event should probably add a count of the number of receivers it was dispatched to, and then when we successfully deliver the event, we increment a count of successes. That way, we would not consider an event entry eligible to be deleted unless the two counts are equal; we want to hang onto events that weren't successfully delivered so any failed deliveries can be re-triggered. GCing an event would also clean up any child delivery attempt records. |
// This performs a `SELECT ... FOR UPDATE SKIP LOCKED` on the | ||
// `webhook_event` table, returning the oldest webhook event which has not | ||
// yet been dispatched to receivers and which is not actively being | ||
// dispatched in another transaction. | ||
// NOTE: it would be kinda nice if this query could also select the | ||
// webhook receivers subscribed to this event, but this requires | ||
// a `FOR UPDATE OF webhook_event` clause to indicate that we only wish | ||
// to lock the `webhook_event` row and not the receiver. | ||
// Unfortunately, I don't believe Diesel supports this at present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this isn't going to work with CRDB, as it doesn't actually implement SKIP LOCKED
. We're going to have to redesign the mechanism that prevents duplicate delivery creation. Another option could be to change the webhook_delivery
table to have a UNIQUE constraint on (rx_id, event_id)
, but this doesn't work with the way I wanted to implement the resend event API (creating a new entry in webhook_delivery
.
I think the correct thing to do here is something like:
INSERT INTO omicron.public.webhook_delivery (
{rx_id},
{event_id},
/* ... */
)
WHERE NOT EXISTS (
SELECT 1 FROM omicron.public.webhook_delivery
WHERE rx_id == {rx_id} AND event_id == {event_id}
)
so that we can prevent creating duplicate deliveries without a lock. But, I can't seem to find a way to do INSERT ... WHERE NOT EXISTS
in Diesel, so this probably needs to be a CTE...
This commit adds (unimplemented) public API endpoints for managing Nexus webhooks, as described in [RFD 364][1]. [1]: https://rfd.shared.oxide.computer/rfd/364#_external_api
Co-authored-by: Augustus Mayo <[email protected]>
It is worth noting that one big thing that's still somewhat missing is the SSRF prevention for webhook delivery requests, which I think we ought not do a release that includes webhooks but doesn't have SSRF prevention. Our plan for protecting against SSRF attacks is that, when constructing the HTTP client that sends webhook delivery requests, bind sockets only on OPTE network interfaces in release builds.1 Doing this should be pretty straightforward, but it unfortunately depends on upstream library changes:
I'm currently planning to hold off on actually merging this branch until this is finished, to avoid releasing an Omicron version that contains a possible SSRF vulnerability we'd have to do a security advisory for. However, it should be pretty straightforward to integrate once all the upstream library changes are released. Therefore, I'd like to start getting reviews on the rest of this branch now, while we wait for all that stuff to be released. Footnotes
|
jj, for whatever reason, is not confused by the diff like git is. The changes to the OpenAPI schema are in fact all adds, as expected. $ jj diff --stat -f main -t eliza/webhook-models openapi/nexus.json
openapi/nexus.json | 1109 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 1109 insertions(+), 0 deletions(-)
$ git diff --stat main...eliza/webhook-models -- openapi/nexus.json
openapi/nexus.json | 18375 ++++++++++++++++++++++++++++++++----------------------------
1 file changed, 9742 insertions(+), 8633 deletions(-) |
nexus/external-api/src/lib.rs
Outdated
path_params: Path<params::WebhookReceiverSelector>, | ||
) -> Result<HttpResponseOk<views::WebhookReceiver>, HttpError>; | ||
|
||
/// Create a new webhook receiver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should remove "a", "new", and the full stop to match the styling of these descriptions for the docs. E.g. "Create webhook receiver"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These become the docs sidebar text.

Here are my suggestions:
List delivery attempts to webhook receiver
Request re-delivery of webhook event
List webhook event classes
Fetch webhook event class
List webhook receivers
Create webhook receiver
Fetch webhook receiver
Update webhook receiver
Delete webhook receiver
Send liveness probe to webhook receiver
List secret IDs for webhook receiver
Add secret to webhook receiver
Remove secret from webhook receiver
Note get -> fetch
for consistency with the other endpoints. To be honest I think we could stand to use get everywhere instead and save two letters, but it's where we are right now. I took out the "configuration for webhook" — I admit it is potentially ambiguous, especially with webhooks, to say "fetch webhook receiver", but I lean toward being consistent with the other endpoints, plus the ambiguity is immediately clarified if you click the endpoint and see what it is. We can even add more description in the doc comment after a blank line if you're worried about it. It shows up on the detail page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was commenting on an outdated version of the list. Will update in a minute with suggestions for the full list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, 4f8c258 reworks the API docs as you've suggested. I also added body text to a few of the endpoints where it felt like there were additional bits that needed a more detailed explanation.
thanks @david-crespo and @benjaminleonard for advice!
Co-authored-by: David Crespo <[email protected]>
OMDB commands added in #7808 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reviewed the schema so far, still need to go through db-queries, the background tasks, the tests, and the API
#[serde( | ||
default = "WebhookDeliveratorConfig::default_second_retry_backoff" | ||
)] | ||
pub second_retry_backoff_secs: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be a time "after the first retry", or relative to the start of the delivery time?
like, if we try to send a delivery, and fail, and then:
first = 5 seconds
second = 10 seconds
Are we expecting:
@ 0 seconds -> send and fail first delivery
@ 5 seconds -> send first retry delivery
@ 10 seconds -> send second retry delivery? Or is this actually at 15 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's supposed to be the time since the previous attempt. So, in your example, it would be 0, 5, and 15 seconds.
use serde::ser::{Serialize, Serializer}; | ||
use std::fmt; | ||
|
||
impl_enum_type!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How critical is it that these webhook event classes are strongly-typed, as opposed to being raw strings present as "data" rather than "schema" in the database?
Renaming or removing these variants will not be trivial for "cockroachdb is silly sometimes" reasons- see: https://github.com/oxidecomputer/omicron/tree/main/schema/crdb#212-changing-enum-variants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question, and it's something I've thought about a bit. My initial plan was, in fact, to just store these as strings in the database. @andrewjstone suggested that we represent them as an enum, instead, to avoid storing a whole bunch of duplicate copies of the same relatively small set of strings. This also has the advantage of tying the set of event classes quite closely to the database schema version, which is used to determine when glob subscriptions need to be reprocessed.
Another option, which occupies a middle ground between using an enum and just storing the string representation of the class for every event, would be to instead have a table of event class strings along with some numeric identifier, so we could represent them more concisely on disk using the numeric identifier. That way, they could be inserted or removed by queries rather than by a schema update.
Though that flexibility seems appealing, I'm not actually sure if it's a good thing. On updates that add new event classes, we'd probably have the new Nexus version go and run some queries to add the new classes to the database. This is a bit ad-hoc, and it means that it's at least theoretically possible to change the set of event classes without changing the schema version, so it's much harder to reason about whether the glob subscriptions are correct based on what's presently in the database. With an enum, any time we add new classes, we must add a migration to do so, which ensures that the schema version accurately indicates what classes exist.
Finally, I'm not actually that convinced that being unable to easily remove event classes is that big of a problem. For enums that represent operational states (e.g. instance_state
or similar), or that represent a policy or mode of behavior, it's actually very important to not be able to represent defunct variants: code that consumes this data needs to handle every possible state or policy or whatever, and it's very unfortunate to have to have match arms or other cases for stuff that's no longer used that just panics or logs an error. In this case, however, these are basically just strings that we do regex matching on, and the enum is mostly used to reduce the size of the database record (and to tie the set of strings to the schema version, as I mentioned). So, if we leave behind some event classes that we've stopped using, but they're still there in the enum...honestly, who cares? They don't actually cause that big of a problem just by being there, and we can filter them out of the "list event classes" API response so that users don't get the idea that they still exist. It's a bit ugly to leave behind stuff that's no longer used, but the consequences are less bad than for something like VmmState
or SledPolicy
...
Hopefully that all makes some amount of sense. I'm certainly not that attached to this approach, but that was the rationale I was operating under.
impl std::str::FromStr for WebhookEventClass { | ||
type Err = EventClassParseError; | ||
fn from_str(s: &str) -> Result<Self, Self::Err> { | ||
for &class in Self::ALL_CLASSES { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many event classes do you think we're gonna end up with, if you had to place a bet, 2-3 years from now?
I don't think think this is a problem now, but I expect this enum will grow quite large as it expands to include at least "all possible faults" and "all possible invocations of the interface", right?
Any operation (outside of tests) that requires iteration over all variants seems like it's worth a little scrutiny IMO
// N.B.: perhaps we ought to use timestamp-based UUIDs for these? | ||
id: WebhookDeliveryUuid::new_v4().into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? What's wrong with using a new_v4
here? AFAICT there shouldn't be any need for determinism for the ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these delivery records represent something that happens at a point in time, I just thought it would be kind of nice to have ordering a list of deliveries by ID also order them temporally (at least, more or less). Yes, they also have timestamps, so it doesn't actually matter that much...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a similar situation with the audit log in #7339 and so far I've landed on paginating by (timestamp, ID)
because of possible ties on the timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, that might be better --- I hadn't wanted to paginate on timestamps for precisely that reason...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely, I'll have to borrow that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-crespo would you possibly be interested in factoring out your timestamp pagination bits to a separate PR so we could land it first and both depend on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i suppose i could also just copy and paste it, and whoever merges second could figure out the merge conflicts, but that feels comparatively unwholesome...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I’ll pull it out into a separate PR.
impl WebhookDeliveryAttempt { | ||
fn response_view(&self) -> Option<views::WebhookDeliveryResponse> { | ||
Some(views::WebhookDeliveryResponse { | ||
status: self.response_status? as u16, // i hate that this has to signed in the database... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be interested in https://github.com/oxidecomputer/omicron/blob/main/nexus/db-model/src/unsigned.rs - we have used different types here to make unexpectedly-signed values a serialization error when we try reading them from the database.
mod test { | ||
use super::*; | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we test some of the error cases too?
Co-authored-by: Sean Klein <[email protected]>
Co-authored-by: Sean Klein <[email protected]>
Co-authored-by: Sean Klein <[email protected]>
This branch adds an MVP implementation of the internal machinery for delivering webhooks from Nexus. This includes:
delivery status
The user-facing interface for webhooks is described in greater detail in RFD 538. The code change in this branch includes a "Big Theory Statement" comment that describes most of the implementation details, so reviewers are encouraged to refer to that for more information on the implementation.
Future Work
Immediate follow-up work (i.e. stuff I'd like to do shortly but would prefer to land in separate PRs):
webhook_delivery
,webhook_delivery_attempt
, andwebhook_event
CRDB tables (need to figure out a good retention policy for events)omdb db webhooks
commands for actually looking at the webhook database tablesNot currently planned, but possible future work:
/v1/webhooks/event-classes
endpoints, or both)fleet.viewer
(see RFD 538 Appendix B.3); probably requires service accounts