-
Notifications
You must be signed in to change notification settings - Fork 200
daemon: Add socket activation via /run/rpm-ostreed.socket #2932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Skipping CI for Draft Pull Request. |
Draft since this depends on https://bugzilla.redhat.com/show_bug.cgi?id=1976303 at least, and there's definitely more cleanup required. |
c24077f
to
b7fd1ca
Compare
This is a much smaller patch related to coreos#2932 which is about ensuring the client gets the actual errors instead of generic DBus activation failure. The obvious step then would be to have the client ask systemd for the status text, but that's slightly messy right now and I think I'd anyways prefer to go the socket route. But I think this is a useful pattern; basically systemd's `StatusText` facility is useful.
This is yet another patch which should help debugging https://bugzilla.redhat.com/show_bug.cgi?id=1958812 like situations in the future. I also have coreos/rpm-ostree#2932 cooking but it's going to take a while to cycle down.
This is to help debug https://bugzilla.redhat.com/show_bug.cgi?id=1958812 like situations in the future. |
(As an example of a next step we could take is to add a |
Though I'm pretty sure this is a dbus-broker bug, or at least design misfeature. If a service fails to activate and has an associated systemd unit, it should probably give us the unit status as a baseline (and |
Interesting idea. Would an alternative approach be to delay all the heavy lifting until |
I think the problem is that in theory there are other consumers of the DBus API that may be expecting the status quo. It'd need analysis. |
For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls `ostree_sysroot_load()`. We also end up scanning the RPM database, and actually parse all the GPG keys in `/etc/pki/rpm-gpg` etc. This causes two related problems: - By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s. - If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout. The solution to this is to adopt systemd socket activation, which drops out DBus as an intermediary. On daemon startup, we now do the process-global initialization (like ostree sysroot) and if that fails, the daemon just sticks around (but without claiming the bus name), ready to return the error message to each client. After this patch: ``` $ systemctl stop rpm-ostreed $ umount /boot $ rpm-ostree status error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found, but no /boot/loader directory ```
b7fd1ca
to
d9166e0
Compare
This is a hackier an alternative to coreos#2932 that we can ship immediately because we won't block on SELinux policy. For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls `ostree_sysroot_load()`. We also end up scanning the RPM database, and actually parse all the GPG keys in `/etc/pki/rpm-gpg` etc. This causes two related problems: - By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s. - If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout. By explicitly using `systemctl start rpm-ostreed.service`, systemd does all of the error checking for us without involving `dbus-broker` as a middleman. Further, by using `systemctl status rpm-ostreed.service` on failure, we reuse systemd's nice rendering of the status of the unit instead of reinventing our own. This PR effectively replicates openshift/machine-config-operator#2642 in our code.
This is a hackier an alternative to coreos#2932 that we can ship immediately because we won't block on SELinux policy. For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls `ostree_sysroot_load()`. We also end up scanning the RPM database, and actually parse all the GPG keys in `/etc/pki/rpm-gpg` etc. This causes two related problems: - By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s. - If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout. By explicitly using `systemctl start rpm-ostreed.service`, systemd does all of the error checking for us without involving `dbus-broker` as a middleman. Further, by using `systemctl status rpm-ostreed.service` on failure, we reuse systemd's nice rendering of the status of the unit instead of reinventing our own. This PR effectively replicates openshift/machine-config-operator#2642 in our code.
I'm +1 for this idea (only took a quick look at the code but it looked fine). |
Not opposed to this, though... IMO the added complexity in the architecture doesn't seem worth it. Re. other D-Bus consumers, between #2945 and this new API, the former is much easier to adapt to. |
This is a hackier an alternative to #2932 that we can ship immediately because we won't block on SELinux policy. For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls `ostree_sysroot_load()`. We also end up scanning the RPM database, and actually parse all the GPG keys in `/etc/pki/rpm-gpg` etc. This causes two related problems: - By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s. - If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout. By explicitly using `systemctl start rpm-ostreed.service`, systemd does all of the error checking for us without involving `dbus-broker` as a middleman. Further, by using `systemctl status rpm-ostreed.service` on failure, we reuse systemd's nice rendering of the status of the unit instead of reinventing our own. This PR effectively replicates openshift/machine-config-operator#2642 in our code.
One thing @travier brought up though is that we could speak DBus over this socket which would mirror what systemd and NetworkManager do. That'd simplify a lot of things. And it'd lead towards us dropping the weird "transaction status over private dbus socket" model we already have in that case. When talking over the private socket we'd just require |
OK I created a "deferred" label we can use to find PRs like this that may make sense later, but probably aren't going to merge anytime soon. |
If we revive this PR we should use tokio for the socket handling now. |
OK I thought I could reopen this PR but github won't let me because I accidentally force-pushed to the branch first. So, recreated the PR in #3874 |
Significantly bump this timeout from the default because we do a lot of stuff on daemon startup. Immediate motivation is https://bugzilla.redhat.com/show_bug.cgi?id=2111817 But this is also related to the same problems that motivated coreos#3850 (cc coreos#2932 ) We switched from the default DBus timeout of 25 seconds to the systemd default of 90s; this bumps us all the way up to 5 minutes. I think the right long term fix is the socket activation, but this is an easy backportable fix that will hopefully paper over spurious failures. (That said, anyone who is hitting this regularly probably has a system too slow to really use, but...let's not stand in their way)
Significantly bump this timeout from the default because we do a lot of stuff on daemon startup. Immediate motivation is https://bugzilla.redhat.com/show_bug.cgi?id=2111817 But this is also related to the same problems that motivated #3850 (cc #2932 ) We switched from the default DBus timeout of 25 seconds to the systemd default of 90s; this bumps us all the way up to 5 minutes. I think the right long term fix is the socket activation, but this is an easy backportable fix that will hopefully paper over spurious failures. (That said, anyone who is hitting this regularly probably has a system too slow to really use, but...let's not stand in their way)
For historical reasons, the daemon ends up doing a lot of
initialization before even claiming the DBus name. For example,
it calls
ostree_sysroot_load()
. We also end up scanningthe RPM database, and actually parse all the GPG keys
in
/etc/pki/rpm-gpg
etc.This causes two related problems:
race against the (pretty low) DBus service timeout of 25s.
easily see the error because it just appears in the systemd
journal. The client will just see a service timeout.
The solution to this is to adopt systemd socket activation,
which drops out DBus as an intermediary. On daemon startup,
we now do the process-global initialization (like ostree
sysroot) and if that fails, the daemon just sticks around
(but without claiming the bus name), ready to return the
error message to each client.
After this patch: