[nexus] webhooks #7277

hawkw · 2024-12-18T21:09:45Z

This branch adds an MVP implementation of the internal machinery for delivering webhooks from Nexus. This includes:

webhook-related external API endpoints (as described in RFD 538)
database tables for storing webhook receiver configurations and, webhook events and tracking their
delivery status
background tasks for actually delivering webhook events to receivers

The user-facing interface for webhooks is described in greater detail in RFD 538. The code change in this branch includes a "Big Theory Statement" comment that describes most of the implementation details, so reviewers are encouraged to refer to that for more information on the implementation.

Future Work

Immediate follow-up work (i.e. stuff I'd like to do shortly but would prefer to land in separate PRs):

Garbage collection for old records in the webhook_delivery, webhook_delivery_attempt, and webhook_event CRDB tables (need to figure out a good retention policy for events)
omdb db webhooks commands for actually looking at the webhook database tables
Oximeter metrics tracking webhook delivery attempt outcomes and latencies

Not currently planned, but possible future work:

Actually record webhook events when stuff happens :)
Some mechanism for communicating JSON schemas for webhook event payloads (either via OpenAPI 3.1, by sticking JSON schemas in the /v1/webhooks/event-classes endpoints, or both)
Allow webhook receivers to have roles with more restrictive permissions than fleet.viewer (see RFD 538 Appendix B.3); probably requires service accounts
Track receiver liveness and alert when a receiver has gone away (see RFD 538 Appendix B.4)

hawkw · 2025-01-24T00:16:26Z

I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that. I'd like to have a way to include a couple "test" variants in there that aren't exposed in the public API, so I'll be giving some thought to how to deal with that.

hawkw · 2025-01-24T00:27:32Z

I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that.

Glob subscription entries in webhook_rx_event_glob should capture the schema version when they're created, so that we can trigger reprocessing (generating the exact event class subscriptions for those globs) if the schema has changed. It's probably fine for nexus to do glob reprocessing on startup rather than in a bg task, although online update might invalidate that assumption.

hawkw · 2025-01-24T00:34:44Z

As far as GCing old events from the event table, dispatching an event should probably add a count of the number of receivers it was dispatched to, and then when we successfully deliver the event, we increment a count of successes. That way, we would not consider an event entry eligible to be deleted unless the two counts are equal; we want to hang onto events that weren't successfully delivered so any failed deliveries can be re-triggered.

GCing an event would also clean up any child delivery attempt records.

nexus/db-queries/src/db/datastore/webhook_event.rs

This commit adds (unimplemented) public API endpoints for managing Nexus webhooks, as described in [RFD 364][1]. [1]: https://rfd.shared.oxide.computer/rfd/364#_external_api

Co-authored-by: Augustus Mayo <[email protected]>

hawkw · 2025-04-24T22:32:14Z

@smklein: okay, I've made the changes I discussed in #7277 (comment), and I would love to get another look when you have the chance. Beyond moving subscription add/remove to their own API endpoints, I've also changed how exact subscriptions are generated for globs: now, it's always done "lazily" when determining what events a receiver is subscribed to, rather than "eagerly" when adding the glob subscription. This way, we can create the glob subscription by just adding its record, so the subscription-add path doesn't have to do a transaction.

I do still need to update with the latest changes from main; #7985, #8003 and similar have introduced a huge pile of merge conflicts that are kind of a pain to track. But, I think the main differences from what you've reviewed previously should be pretty stable across that.

smklein

Mostly LGTM, but why not cover the endpoints in nexus/tests/integration_tests/endpoints.rs ? not sure I'm understanding why they're excluded

nexus/tests/integration_tests/endpoints.rs

nexus/types/src/external_api/params.rs

@smklein

thanks @smklein Co-authored-by: Sean Klein <[email protected]>

nexus/tests/output/unexpected-authz-endpoints.txt

david-crespo · 2025-04-25T22:48:52Z

openapi/nexus.json

+                }
+              }
+            }
+          },


This is a ramble based on my gut reaction, but I don't have a solid recommendation for what I think would be better. It might be as simple as renaming WebhookSubscription to WebhookEvent or something.

There are a bunch of spots that use the same WebhookSubscription type:

WebhookReceiver events field

Notable that this field is not called "subscriptions." If anything I like "events" better -- maybe the endpoints should be called add/remove events.

Add subscription request body

Add subscription response body

Delete subscription subscription path parameter

filter query param on event classes list endpoint (didn't notice this one until the end when I was putting together this list, so it's less important to me, but still probably deserves to be unified with the others)

This is of course elegant in a way, but I also found it pretty surprising and had to think through it quite a bit to understand -- especially the fact that the request and response bodies have the same type as a path parameter. I don't think there are any other places we can do that, simply because there aren't many other resources that are just strings. (Maybe there is a list of IP addresses somewhere that works like this, where the IP string is its own identifier.)

Usually a path parameter's type makes clear (like NameOrId does) that it's some kind of identifier with a natural string representation. So while we call it the field project, project: NameOrId make clear we're talking about an identifier. subscription is therefore correct as a name for the field, but the name of the type messes me up. The thing that's really unusual in this case is that the resource itself also is fundamentally a string. Maybe the main thing throwing me off is the name WebhookSubscription, which puts me in mind of a bigger structured object -- really what it makes me think of is the receiver.

A related but distinct issue is that we generally want API request and response bodies to be JSON objects rather than strings. We've run into issues with client generation when that has not been the case. A string is valid JSON, but I think this would be the only endpoint that returns a string. One reason to prefer objects is that objects can be extended by adding another key without changing the basic shape of the thing. If these unlikely to change their shape, consistency with other endpoints is probably a stronger argument. On the other hand, it would feel silly to have the list of subscriptions in the receiver response be a list of objects like { glob: 'my.event.*' } rather than a list of strings, so that's a conundrum, I guess.

david-crespo · 2025-04-25T22:55:38Z

nexus/external-api/src/lib.rs

+        path = "/v1/webhooks/receivers/{receiver}/subscriptions/{subscription}",
+        tags = ["system/webhooks"],
+    }]
+    async fn webhook_receiver_subscription_delete(


This is a nitpick but I prefer remove to delete for the opposite of add here. Compare to IP range add/remove. For secrets, you have add/delete also, but I think in that case my preference is to make it create/delete. The fact that the secret is a child of the webhook is not dispositive as an argument for "add" — see all project-scoped resources like disk and instance, which we are nonetheless creating rather than adding. For me it's more about whether you are creating and deleting an instance of an independent resource -- yes for secret, no for event subscription glob string. For me IP ranges are similar to event classes in that they are fully reducible to their string representations and they really are subservient in a way to their parent.

Yeah, that makes sense, I'll change the naming to be more consistent with the other APIs. From the authz perspective, I've also tried to treat the secrets as their own resource but the subscriptions are a list that's logically associated with the receiver resource — this felt right, since the secrets have IDs while the subscriptions don't.

Another option would be to just call it subscribe and unsubscribe instead of add/remove, but I dunno what our position on using more specific verbs like that is...

I like add/remove because it’s easier to guess that it will result in the events array on the receiver being changed (or it would if the names matched)

@david-crespo

I've made the following changes to the APIs for adding and removing subscriptions: - renamed `webhook_receiver_subscription_delete` to `webhook_receiver_subscription_remove`. - renamed the `events` field in receiver models to `subscriptions`. - wrapped the JSON request and response bodies for `webhook_receiver_subscription_add` in JSON objects. These are separate models in `params` and `views` for the request/response, respectively, even though they currently both just contain one field, of the `shared::WebhookSubscription` type. I felt like that was worthwhile as we may in future want to add different fields to the request and response models. Elsewhere, we still use the `shared::WebhookSubscription` type, including in the receiver models, as those felt awkward when it had to be wrapped in an additional object. It's important to still be able to use this as a bare string when it's used in path or query params. @david-crespo please let me know how you feel about these changes (but don't feel like you need to this weekend...)

this is now vestigial, as the receiver update query no longer mutates subscriptions.

david-crespo

Amazing. Thanks for addressing my comments. If I have any more to say about #7277 (comment), it will only be to propose renaming a key or a struct, and I'll do the PR myself to make sure I like what I'm proposing. I'm satisfied with this as-is regardless.

hawkw · 2025-04-28T17:05:37Z

lovely, thanks for all the advice @david-crespo, it's very appreciated!

As a follow-up from #7277, this commit adds some OMDB commands for querying and displaying the status of webhook receivers, webhook deliveries, and webhook events.

@ahl

In conversation with @ahl, we have determined that the external API for webhooks added in #7277 should be changed to focus on "alerts" as the first-class user-facing concept, with "webhooks" as one delivery mechanism for alerts. This way, we can talk about alerts as an entity in the API that exist independently of webhooks that deliver alerts, and the same alert types can be shared with other alert delivery mechanisms if any are added in the future. What we currently refer to as "webhook events" and "webhook event classes" are therefore renamed to "alerts" and "alert classes". The current concept of "webhook receivers" is generalized to an "alert receiver" resource, of which webhook receivers are (currently) the only subtype. This way, if we add other mechanisms of delivering alerts in the future (email, first-class Slack integration, etc), we can introduce new subtypes of alert receivers. I've restructured the API to have both `/v1/alert-receivers/...` and `/v1/webhook-receivers/...` routes, with operations common to all alert receivers (list, view, add/remove subscriptions, delete) under the `alert-receivers` route, and operations related to webhook-specific configuration (add/remove secrets, probe, deliveries) under the `webhook-receivers` route. I've also changed the `AlertReceiver` view to have a "kind" enum that stores the subtype-specific configuration; currently, this will only ever be "webhook", but I thought it was worth doing this now to make future additions cause less breakage for API consumers. This is, admittedly, a somewhat large diff, but fortunately, most of it is just renaming stuff and moving it around. Reviewers can focus more or less exclusively to the changes to the external API routes and models, and maybe the database migrations. Any mistakes while renaming and moving things around have already been caught by the Rust compiler. :)

hawkw force-pushed the eliza/webhook-models branch from 51f7f8e to 139cfe6 Compare December 18, 2024 21:10

hawkw changed the base branch from eliza/webhook-api to main December 18, 2024 21:11

hawkw requested a review from augustuswm December 18, 2024 21:11

hawkw force-pushed the eliza/webhook-models branch 2 times, most recently from 140aea4 to 0b80c8f Compare January 8, 2025 17:28

hawkw changed the title ~~[nexus] Webhook DB models~~ [nexus] webhooks Jan 11, 2025

hawkw force-pushed the eliza/webhook-models branch from 41cf0b0 to 2bc5925 Compare January 17, 2025 19:20

hawkw mentioned this pull request Jan 24, 2025

[nexus] Webhook API skeleton #7274

Closed

hawkw commented Jan 24, 2025

View reviewed changes

nexus/db-queries/src/db/datastore/webhook_event.rs Outdated Show resolved Hide resolved

hawkw and others added 18 commits February 3, 2025 10:24

[nexus] Webhook API skeleton

59a10fc

This commit adds (unimplemented) public API endpoints for managing Nexus webhooks, as described in [RFD 364][1]. [1]: https://rfd.shared.oxide.computer/rfd/364#_external_api

naming consistency edits from @augustuswm

4ece574

Co-authored-by: Augustus Mayo <[email protected]>

fix cargo workspace hack

6e8f8ac

update to match RFD 538

478b2ce

also do event classes APIs

c730c63

apparently this is also a thing we need to do

d5ed2ba

[nexus] start DB model for webhooks

5d4f13b

message queue

5bf2633

more diesel plumbing

52efc08

terminology tweaks

72754d7

change tracking of delivery attempts

fa75a2f

s/deliverey/delivery

f9e5fc9

add 'failed_timeout' delivery result

432057a

ag

52d0f24

ag

1aa95c3

more models for webhook delivery

75ac350

models for receivers

59c98a0

s/delivery_attempts/delivery_attempt

9fdd77a

clippy cleanliness

a5d8d28

Merge remote-tracking branch 'origin' into eliza/webhook-models

a7842d2

hawkw requested a review from smklein April 25, 2025 17:04

clearer docs for delivery state query params

26ad653

smklein approved these changes Apr 25, 2025

View reviewed changes

nexus/tests/integration_tests/endpoints.rs Outdated Show resolved Hide resolved

nexus/types/src/external_api/params.rs Outdated Show resolved Hide resolved

nexus/types/src/external_api/params.rs Outdated Show resolved Hide resolved

hawkw and others added 4 commits April 25, 2025 11:03

copyedits from code review

f904b23

thanks @smklein Co-authored-by: Sean Klein <[email protected]>

delete subscriptions by path (more REST-y)

60e9617

make subscription add idempotent

0057593

whoops sean's copyedits needed new openapi

854731d

david-crespo reviewed Apr 25, 2025

View reviewed changes

nexus/tests/output/unexpected-authz-endpoints.txt Outdated Show resolved Hide resolved

fix unauthorized ccoverage files

6809055

david-crespo reviewed Apr 25, 2025

View reviewed changes

hawkw added 2 commits April 26, 2025 10:41

remove subscription gen from rx update model

e30019a

this is now vestigial, as the receiver update query no longer mutates subscriptions.

david-crespo approved these changes Apr 28, 2025

View reviewed changes

fix test not expecting responses to be objects

ef222f1

hawkw enabled auto-merge (squash) April 28, 2025 17:08

hawkw merged commit 21b35cb into main Apr 28, 2025
19 checks passed

hawkw deleted the eliza/webhook-models branch April 28, 2025 18:28

This was referenced Apr 28, 2025

[2/n][omdb] add initial webhook receiver OMDB commands #7808

Merged

webhooks: JSON schemas for webhook event payloads #8065

Open

benjaminleonard mentioned this pull request May 1, 2025

Initial Webhook Designs oxidecomputer/console#2814

Open

hawkw mentioned this pull request May 1, 2025

webhooks: delete old per-event records #8076

Open

hawkw mentioned this pull request May 15, 2025

[external API] alerts: the renamening #8169

Merged

hawkw mentioned this pull request May 27, 2025

Allow limited inbound ICMP to Nexus, add ICMP type/code filters to firewall rules #8194

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[nexus] webhooks #7277

[nexus] webhooks #7277

Uh oh!

hawkw commented Dec 18, 2024 •

edited

Loading

Uh oh!

hawkw commented Jan 24, 2025

Uh oh!

hawkw commented Jan 24, 2025

Uh oh!

hawkw commented Jan 24, 2025

Uh oh!

Uh oh!

hawkw commented Apr 24, 2025

Uh oh!

smklein left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

david-crespo Apr 25, 2025

Uh oh!

david-crespo Apr 25, 2025 •

edited

Loading

Uh oh!

hawkw Apr 25, 2025

Uh oh!

david-crespo Apr 25, 2025

Uh oh!

david-crespo left a comment

Uh oh!

hawkw commented Apr 28, 2025

Uh oh!

Uh oh!

Uh oh!

[nexus] webhooks #7277

[nexus] webhooks #7277

Uh oh!

Conversation

hawkw commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Future Work

Uh oh!

hawkw commented Jan 24, 2025

Uh oh!

hawkw commented Jan 24, 2025

Uh oh!

hawkw commented Jan 24, 2025

Uh oh!

Uh oh!

hawkw commented Apr 24, 2025

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

david-crespo Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

david-crespo Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

david-crespo Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

david-crespo left a comment

Choose a reason for hiding this comment

Uh oh!

hawkw commented Apr 28, 2025

Uh oh!

Uh oh!

Uh oh!

hawkw commented Dec 18, 2024 •

edited

Loading

david-crespo Apr 25, 2025 •

edited

Loading