Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] webhooks #7277

Open
wants to merge 162 commits into
base: main
Choose a base branch
from
Open

[nexus] webhooks #7277

wants to merge 162 commits into from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Dec 18, 2024

This branch adds an MVP implementation of the internal machinery for delivering webhooks from Nexus. This includes:

  • webhook-related external API endpoints (as described in RFD 538)
  • database tables for storing webhook receiver configurations and, webhook events and tracking their
    delivery status
  • background tasks for actually delivering webhook events to receivers

The user-facing interface for webhooks is described in greater detail in RFD 538. The code change in this branch includes a "Big Theory Statement" comment that describes most of the implementation details, so reviewers are encouraged to refer to that for more information on the implementation.

Future Work

Immediate follow-up work (i.e. stuff I'd like to do shortly but would prefer to land in separate PRs):

  • Garbage collection for old records in the webhook_delivery, webhook_delivery_attempt, and webhook_event CRDB tables (need to figure out a good retention policy for events)
  • omdb db webhooks commands for actually looking at the webhook database tables
  • Oximeter metrics tracking webhook delivery attempt outcomes and latencies

Not currently planned, but possible future work:

  • Actually record webhook events when stuff happens :)
  • Some mechanism for communicating JSON schemas for webhook event payloads (either via OpenAPI 3.1, by sticking JSON schemas in the /v1/webhooks/event-classes endpoints, or both)
  • Allow webhook receivers to have roles with more restrictive permissions than fleet.viewer (see RFD 538 Appendix B.3); probably requires service accounts
  • Track receiver liveness and alert when a receiver has gone away (see RFD 538 Appendix B.4)

@hawkw hawkw force-pushed the eliza/webhook-models branch from 51f7f8e to 139cfe6 Compare December 18, 2024 21:10
@hawkw hawkw changed the base branch from eliza/webhook-api to main December 18, 2024 21:11
@hawkw hawkw requested a review from augustuswm December 18, 2024 21:11
@hawkw hawkw force-pushed the eliza/webhook-models branch 2 times, most recently from 140aea4 to 0b80c8f Compare January 8, 2025 17:28
@hawkw hawkw changed the title [nexus] Webhook DB models [nexus] webhooks Jan 11, 2025
@hawkw hawkw force-pushed the eliza/webhook-models branch from 41cf0b0 to 2bc5925 Compare January 17, 2025 19:20
@hawkw
Copy link
Member Author

hawkw commented Jan 24, 2025

I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that. I'd like to have a way to include a couple "test" variants in there that aren't exposed in the public API, so I'll be giving some thought to how to deal with that.

@hawkw
Copy link
Member Author

hawkw commented Jan 24, 2025

I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that.

Glob subscription entries in webhook_rx_event_glob should capture the schema version when they're created, so that we can trigger reprocessing (generating the exact event class subscriptions for those globs) if the schema has changed. It's probably fine for nexus to do glob reprocessing on startup rather than in a bg task, although online update might invalidate that assumption.

@hawkw
Copy link
Member Author

hawkw commented Jan 24, 2025

As far as GCing old events from the event table, dispatching an event should probably add a count of the number of receivers it was dispatched to, and then when we successfully deliver the event, we increment a count of successes. That way, we would not consider an event entry eligible to be deleted unless the two counts are equal; we want to hang onto events that weren't successfully delivered so any failed deliveries can be re-triggered.

GCing an event would also clean up any child delivery attempt records.

Comment on lines 61 to 70
// This performs a `SELECT ... FOR UPDATE SKIP LOCKED` on the
// `webhook_event` table, returning the oldest webhook event which has not
// yet been dispatched to receivers and which is not actively being
// dispatched in another transaction.
// NOTE: it would be kinda nice if this query could also select the
// webhook receivers subscribed to this event, but this requires
// a `FOR UPDATE OF webhook_event` clause to indicate that we only wish
// to lock the `webhook_event` row and not the receiver.
// Unfortunately, I don't believe Diesel supports this at present.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this isn't going to work with CRDB, as it doesn't actually implement SKIP LOCKED. We're going to have to redesign the mechanism that prevents duplicate delivery creation. Another option could be to change the webhook_delivery table to have a UNIQUE constraint on (rx_id, event_id), but this doesn't work with the way I wanted to implement the resend event API (creating a new entry in webhook_delivery.

I think the correct thing to do here is something like:

INSERT INTO omicron.public.webhook_delivery (
    {rx_id},
    {event_id},
    /* ... */
)
WHERE NOT EXISTS (
     SELECT 1 FROM omicron.public.webhook_delivery
     WHERE rx_id == {rx_id} AND event_id == {event_id}
)

so that we can prevent creating duplicate deliveries without a lock. But, I can't seem to find a way to do INSERT ... WHERE NOT EXISTS in Diesel, so this probably needs to be a CTE...

@hawkw
Copy link
Member Author

hawkw commented Mar 13, 2025

It is worth noting that one big thing that's still somewhat missing is the SSRF prevention for webhook delivery requests, which I think we ought not do a release that includes webhooks but doesn't have SSRF prevention.

Our plan for protecting against SSRF attacks is that, when constructing the HTTP client that sends webhook delivery requests, bind sockets only on OPTE network interfaces in release builds.1 Doing this should be pretty straightforward, but it unfortunately depends on upstream library changes:

I'm currently planning to hold off on actually merging this branch until this is finished, to avoid releasing an Omicron version that contains a possible SSRF vulnerability we'd have to do a security advisory for. However, it should be pretty straightforward to integrate once all the upstream library changes are released. Therefore, I'd like to start getting reviews on the rest of this branch now, while we wait for all that stuff to be released.

Footnotes

  1. We can't do this in testing as we would like to be able to run the tests for webhooks on non-Oxide development systems.

@hawkw hawkw marked this pull request as ready for review March 13, 2025 19:00
@hawkw hawkw requested review from ahl, smklein and david-crespo March 13, 2025 19:01
@david-crespo
Copy link
Contributor

david-crespo commented Mar 13, 2025

jj, for whatever reason, is not confused by the diff like git is. The changes to the OpenAPI schema are in fact all adds, as expected.

$ jj diff --stat -f main -t eliza/webhook-models openapi/nexus.json
openapi/nexus.json | 1109 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 1109 insertions(+), 0 deletions(-)

$ git diff --stat main...eliza/webhook-models -- openapi/nexus.json
openapi/nexus.json | 18375 ++++++++++++++++++++++++++++++++----------------------------
1 file changed, 9742 insertions(+), 8633 deletions(-)

path_params: Path<params::WebhookReceiverSelector>,
) -> Result<HttpResponseOk<views::WebhookReceiver>, HttpError>;

/// Create a new webhook receiver.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should remove "a", "new", and the full stop to match the styling of these descriptions for the docs. E.g. "Create webhook receiver"

Copy link
Contributor

@david-crespo david-crespo Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These become the docs sidebar text.

image

Here are my suggestions:

List delivery attempts to webhook receiver
Request re-delivery of webhook event
List webhook event classes
Fetch webhook event class
List webhook receivers
Create webhook receiver
Fetch webhook receiver
Update webhook receiver
Delete webhook receiver
Send liveness probe to webhook receiver
List secret IDs for webhook receiver
Add secret to webhook receiver
Remove secret from webhook receiver

Note get -> fetch for consistency with the other endpoints. To be honest I think we could stand to use get everywhere instead and save two letters, but it's where we are right now. I took out the "configuration for webhook" — I admit it is potentially ambiguous, especially with webhooks, to say "fetch webhook receiver", but I lean toward being consistent with the other endpoints, plus the ambiguity is immediately clarified if you click the endpoint and see what it is. We can even add more description in the doc comment after a blank line if you're worried about it. It shows up on the detail page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was commenting on an outdated version of the list. Will update in a minute with suggestions for the full list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions, I'll reword these. I notice that "get" and "fetch" are used pretty interchangeably in other APIs...we should probably give that a pass to make it more consistent (separately from this PR of course):
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, 4f8c258 reworks the API docs as you've suggested. I also added body text to a few of the endpoints where it felt like there were additional bits that needed a more detailed explanation.

@hawkw
Copy link
Member Author

hawkw commented Mar 17, 2025

OMDB commands added in #7808

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed the schema so far, still need to go through db-queries, the background tasks, the tests, and the API

#[serde(
default = "WebhookDeliveratorConfig::default_second_retry_backoff"
)]
pub second_retry_backoff_secs: u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be a time "after the first retry", or relative to the start of the delivery time?

like, if we try to send a delivery, and fail, and then:

first = 5 seconds
second = 10 seconds

Are we expecting:

@ 0 seconds -> send and fail first delivery
@ 5 seconds -> send first retry delivery
@ 10 seconds -> send second retry delivery? Or is this actually at 15 seconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's supposed to be the time since the previous attempt. So, in your example, it would be 0, 5, and 15 seconds.

use serde::ser::{Serialize, Serializer};
use std::fmt;

impl_enum_type!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How critical is it that these webhook event classes are strongly-typed, as opposed to being raw strings present as "data" rather than "schema" in the database?

Renaming or removing these variants will not be trivial for "cockroachdb is silly sometimes" reasons- see: https://github.com/oxidecomputer/omicron/tree/main/schema/crdb#212-changing-enum-variants

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question, and it's something I've thought about a bit. My initial plan was, in fact, to just store these as strings in the database. @andrewjstone suggested that we represent them as an enum, instead, to avoid storing a whole bunch of duplicate copies of the same relatively small set of strings. This also has the advantage of tying the set of event classes quite closely to the database schema version, which is used to determine when glob subscriptions need to be reprocessed.

Another option, which occupies a middle ground between using an enum and just storing the string representation of the class for every event, would be to instead have a table of event class strings along with some numeric identifier, so we could represent them more concisely on disk using the numeric identifier. That way, they could be inserted or removed by queries rather than by a schema update.

Though that flexibility seems appealing, I'm not actually sure if it's a good thing. On updates that add new event classes, we'd probably have the new Nexus version go and run some queries to add the new classes to the database. This is a bit ad-hoc, and it means that it's at least theoretically possible to change the set of event classes without changing the schema version, so it's much harder to reason about whether the glob subscriptions are correct based on what's presently in the database. With an enum, any time we add new classes, we must add a migration to do so, which ensures that the schema version accurately indicates what classes exist.

Finally, I'm not actually that convinced that being unable to easily remove event classes is that big of a problem. For enums that represent operational states (e.g. instance_state or similar), or that represent a policy or mode of behavior, it's actually very important to not be able to represent defunct variants: code that consumes this data needs to handle every possible state or policy or whatever, and it's very unfortunate to have to have match arms or other cases for stuff that's no longer used that just panics or logs an error. In this case, however, these are basically just strings that we do regex matching on, and the enum is mostly used to reduce the size of the database record (and to tie the set of strings to the schema version, as I mentioned). So, if we leave behind some event classes that we've stopped using, but they're still there in the enum...honestly, who cares? They don't actually cause that big of a problem just by being there, and we can filter them out of the "list event classes" API response so that users don't get the idea that they still exist. It's a bit ugly to leave behind stuff that's no longer used, but the consequences are less bad than for something like VmmState or SledPolicy...

Hopefully that all makes some amount of sense. I'm certainly not that attached to this approach, but that was the rationale I was operating under.

impl std::str::FromStr for WebhookEventClass {
type Err = EventClassParseError;
fn from_str(s: &str) -> Result<Self, Self::Err> {
for &class in Self::ALL_CLASSES {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many event classes do you think we're gonna end up with, if you had to place a bet, 2-3 years from now?

I don't think think this is a problem now, but I expect this enum will grow quite large as it expands to include at least "all possible faults" and "all possible invocations of the interface", right?

Any operation (outside of tests) that requires iteration over all variants seems like it's worth a little scrutiny IMO

Comment on lines +80 to +81
// N.B.: perhaps we ought to use timestamp-based UUIDs for these?
id: WebhookDeliveryUuid::new_v4().into(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? What's wrong with using a new_v4 here? AFAICT there shouldn't be any need for determinism for the ID

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these delivery records represent something that happens at a point in time, I just thought it would be kind of nice to have ordering a list of deliveries by ID also order them temporally (at least, more or less). Yes, they also have timestamps, so it doesn't actually matter that much...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar situation with the audit log in #7339 and so far I've landed on paginating by (timestamp, ID) because of possible ties on the timestamp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, that might be better --- I hadn't wanted to paginate on timestamps for precisely that reason...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely, I'll have to borrow that!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-crespo would you possibly be interested in factoring out your timestamp pagination bits to a separate PR so we could land it first and both depend on it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i suppose i could also just copy and paste it, and whoever merges second could figure out the merge conflicts, but that feels comparatively unwholesome...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I’ll pull it out into a separate PR.

impl WebhookDeliveryAttempt {
fn response_view(&self) -> Option<views::WebhookDeliveryResponse> {
Some(views::WebhookDeliveryResponse {
status: self.response_status? as u16, // i hate that this has to signed in the database...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be interested in https://github.com/oxidecomputer/omicron/blob/main/nexus/db-model/src/unsigned.rs - we have used different types here to make unexpectedly-signed values a serialization error when we try reading them from the database.

mod test {
use super::*;

#[test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we test some of the error cases too?

hawkw and others added 3 commits March 19, 2025 18:13
Co-authored-by: Sean Klein <[email protected]>
Co-authored-by: Sean Klein <[email protected]>
Co-authored-by: Sean Klein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants