daemon: Add socket activation via /run/rpm-ostreed.socket #2932

cgwalters · 2021-06-25T19:29:10Z

For historical reasons, the daemon ends up doing a lot of
initialization before even claiming the DBus name. For example,
it calls ostree_sysroot_load(). We also end up scanning
the RPM database, and actually parse all the GPG keys
in /etc/pki/rpm-gpg etc.

This causes two related problems:

By doing all this work before claiming the bus name, we
race against the (pretty low) DBus service timeout of 25s.
If something hard fails at initialization, the client can't
easily see the error because it just appears in the systemd
journal. The client will just see a service timeout.

The solution to this is to adopt systemd socket activation,
which drops out DBus as an intermediary. On daemon startup,
we now do the process-global initialization (like ostree
sysroot) and if that fails, the daemon just sticks around
(but without claiming the bus name), ready to return the
error message to each client.

After this patch:

$ systemctl stop rpm-ostreed
$ umount /boot
$ rpm-ostree status
error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found, but no /boot/loader directory

openshift-ci · 2021-06-25T19:29:12Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

cgwalters · 2021-06-25T19:29:22Z

Draft since this depends on https://bugzilla.redhat.com/show_bug.cgi?id=1976303 at least, and there's definitely more cleanup required.

This is a much smaller patch related to coreos#2932 which is about ensuring the client gets the actual errors instead of generic DBus activation failure. The obvious step then would be to have the client ask systemd for the status text, but that's slightly messy right now and I think I'd anyways prefer to go the socket route. But I think this is a useful pattern; basically systemd's `StatusText` facility is useful.

This is yet another patch which should help debugging https://bugzilla.redhat.com/show_bug.cgi?id=1958812 like situations in the future. I also have coreos/rpm-ostree#2932 cooking but it's going to take a while to cycle down.

cgwalters · 2021-06-25T21:08:59Z

This is to help debug https://bugzilla.redhat.com/show_bug.cgi?id=1958812 like situations in the future.

cgwalters · 2021-06-25T21:12:42Z

(As an example of a next step we could take is to add a RegisterClient like API here - which would be a whole step towards dropping DBus out of the critical path)

cgwalters · 2021-06-25T21:13:55Z

The client will just see a service timeout.

Though I'm pretty sure this is a dbus-broker bug, or at least design misfeature. If a service fails to activate and has an associated systemd unit, it should probably give us the unit status as a baseline (and StatusText if set).

jlebon · 2021-06-29T15:12:20Z

Interesting idea. Would an alternative approach be to delay all the heavy lifting until RegisterClient() (or some other method) so that we could return the error through that?

cgwalters · 2021-06-29T20:16:24Z

Interesting idea. Would an alternative approach be to delay all the heavy lifting until RegisterClient() (or some other method) so that we could return the error through that?

I think the problem is that in theory there are other consumers of the DBus API that may be expecting the status quo. It'd need analysis.

For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls `ostree_sysroot_load()`. We also end up scanning the RPM database, and actually parse all the GPG keys in `/etc/pki/rpm-gpg` etc. This causes two related problems: - By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s. - If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout. The solution to this is to adopt systemd socket activation, which drops out DBus as an intermediary. On daemon startup, we now do the process-global initialization (like ostree sysroot) and if that fails, the daemon just sticks around (but without claiming the bus name), ready to return the error message to each client. After this patch: ``` $ systemctl stop rpm-ostreed $ umount /boot $ rpm-ostree status error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found, but no /boot/loader directory ```

This is a hackier an alternative to coreos#2932 that we can ship immediately because we won't block on SELinux policy. For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls `ostree_sysroot_load()`. We also end up scanning the RPM database, and actually parse all the GPG keys in `/etc/pki/rpm-gpg` etc. This causes two related problems: - By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s. - If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout. By explicitly using `systemctl start rpm-ostreed.service`, systemd does all of the error checking for us without involving `dbus-broker` as a middleman. Further, by using `systemctl status rpm-ostreed.service` on failure, we reuse systemd's nice rendering of the status of the unit instead of reinventing our own. This PR effectively replicates openshift/machine-config-operator#2642 in our code.

travier · 2021-06-30T14:42:41Z

I'm +1 for this idea (only took a quick look at the code but it looked fine).

jlebon · 2021-06-30T15:00:29Z

Not opposed to this, though... IMO the added complexity in the architecture doesn't seem worth it. Re. other D-Bus consumers, between #2945 and this new API, the former is much easier to adapt to.

This is a hackier an alternative to #2932 that we can ship immediately because we won't block on SELinux policy. For historical reasons, the daemon ends up doing a lot of initialization before even claiming the DBus name. For example, it calls `ostree_sysroot_load()`. We also end up scanning the RPM database, and actually parse all the GPG keys in `/etc/pki/rpm-gpg` etc. This causes two related problems: - By doing all this work before claiming the bus name, we race against the (pretty low) DBus service timeout of 25s. - If something hard fails at initialization, the client can't easily see the error because it just appears in the systemd journal. The client will just see a service timeout. By explicitly using `systemctl start rpm-ostreed.service`, systemd does all of the error checking for us without involving `dbus-broker` as a middleman. Further, by using `systemctl status rpm-ostreed.service` on failure, we reuse systemd's nice rendering of the status of the unit instead of reinventing our own. This PR effectively replicates openshift/machine-config-operator#2642 in our code.

cgwalters · 2021-06-30T16:19:03Z

One thing @travier brought up though is that we could speak DBus over this socket which would mirror what systemd and NetworkManager do. That'd simplify a lot of things. And it'd lead towards us dropping the weird "transaction status over private dbus socket" model we already have in that case.

When talking over the private socket we'd just require RegisterClient to run before we initialize anything; that'd give us a way to propagate initialization failures back as a DBus error message to the client.

cgwalters · 2021-07-09T15:57:22Z

OK I created a "deferred" label we can use to find PRs like this that may make sense later, but probably aren't going to merge anytime soon.

cgwalters · 2022-07-19T13:13:35Z

If we revive this PR we should use tokio for the socket handling now.

cgwalters · 2022-07-22T18:07:29Z

OK I thought I could reopen this PR but github won't let me because I accidentally force-pushed to the branch first.

So, recreated the PR in #3874

Significantly bump this timeout from the default because we do a lot of stuff on daemon startup. Immediate motivation is https://bugzilla.redhat.com/show_bug.cgi?id=2111817 But this is also related to the same problems that motivated coreos#3850 (cc coreos#2932 ) We switched from the default DBus timeout of 25 seconds to the systemd default of 90s; this bumps us all the way up to 5 minutes. I think the right long term fix is the socket activation, but this is an easy backportable fix that will hopefully paper over spurious failures. (That said, anyone who is hitting this regularly probably has a system too slow to really use, but...let's not stand in their way)

Significantly bump this timeout from the default because we do a lot of stuff on daemon startup. Immediate motivation is https://bugzilla.redhat.com/show_bug.cgi?id=2111817 But this is also related to the same problems that motivated #3850 (cc #2932 ) We switched from the default DBus timeout of 25 seconds to the systemd default of 90s; this bumps us all the way up to 5 minutes. I think the right long term fix is the socket activation, but this is an easy backportable fix that will hopefully paper over spurious failures. (That said, anyone who is hitting this regularly probably has a system too slow to really use, but...let's not stand in their way)

openshift-ci bot added the do-not-merge/work-in-progress label Jun 25, 2021

cgwalters force-pushed the privsocket-activation branch from c24077f to b7fd1ca Compare June 25, 2021 20:32

cgwalters mentioned this pull request Jun 25, 2021

daemon: If we encounter a startup error, set it as our unit status text #2933

Merged

cgwalters mentioned this pull request Jun 25, 2021

daemon: Dump systemctl status rpm-ostreed on load failure openshift/machine-config-operator#2642

Merged

cgwalters mentioned this pull request Jun 25, 2021

bus activated services that fail can just get Could not activate remote peer errors bus1/dbus-broker#269

Closed

openshift-ci bot added the needs-rebase label Jun 28, 2021

cgwalters force-pushed the privsocket-activation branch from b7fd1ca to d9166e0 Compare June 29, 2021 20:25

openshift-ci bot removed the needs-rebase label Jun 29, 2021

cgwalters mentioned this pull request Jun 29, 2021

client: Explicitly systemctl start rpm-ostreed if root, dump status #2945

Merged

cgwalters added the deferred Work that may make sense later label Jul 9, 2021

cgwalters closed this Jul 9, 2021

cgwalters mentioned this pull request Jul 19, 2022

Support delegation of privilege using LoadCredential=, add socket activation #3850

Open

cgwalters mentioned this pull request Jul 22, 2022

daemon: Add socket activation via /run/rpm-ostreed.socket #3874

Draft

cgwalters mentioned this pull request Aug 4, 2022

unit: Bump TimeoutStartSec=5m #3905

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daemon: Add socket activation via /run/rpm-ostreed.socket #2932

daemon: Add socket activation via /run/rpm-ostreed.socket #2932

cgwalters commented Jun 25, 2021

openshift-ci bot commented Jun 25, 2021

cgwalters commented Jun 25, 2021

cgwalters commented Jun 25, 2021

cgwalters commented Jun 25, 2021

cgwalters commented Jun 25, 2021

jlebon commented Jun 29, 2021

cgwalters commented Jun 29, 2021

travier commented Jun 30, 2021

jlebon commented Jun 30, 2021

cgwalters commented Jun 30, 2021

cgwalters commented Jul 9, 2021

cgwalters commented Jul 19, 2022

cgwalters commented Jul 22, 2022

daemon: Add socket activation via /run/rpm-ostreed.socket #2932

daemon: Add socket activation via /run/rpm-ostreed.socket #2932

Conversation

cgwalters commented Jun 25, 2021

openshift-ci bot commented Jun 25, 2021

cgwalters commented Jun 25, 2021

cgwalters commented Jun 25, 2021

cgwalters commented Jun 25, 2021

cgwalters commented Jun 25, 2021

jlebon commented Jun 29, 2021

cgwalters commented Jun 29, 2021

travier commented Jun 30, 2021

jlebon commented Jun 30, 2021

cgwalters commented Jun 30, 2021

cgwalters commented Jul 9, 2021

cgwalters commented Jul 19, 2022

cgwalters commented Jul 22, 2022