Skip to content

daemon: Dump systemctl status rpm-ostreed on load failure #2642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented Jun 25, 2021

daemon: Rename os variable to hostos

Prep for a future patch; os conflicts with the standard Go package.


daemon: Dump systemctl status rpm-ostreed on load failure

This is yet another patch which should help debugging
https://bugzilla.redhat.com/show_bug.cgi?id=1958812
like situations in the future.

I also have coreos/rpm-ostree#2932
cooking but it's going to take a while to cycle down.


Prep for a future patch; `os` conflicts with the standard Go package.
This is yet another patch which should help debugging
https://bugzilla.redhat.com/show_bug.cgi?id=1958812
like situations in the future.

I also have coreos/rpm-ostree#2932
cooking but it's going to take a while to cycle down.
osImageURL, osVersion, err = nodeUpdaterClient.GetBootedOSImageURL()
if err != nil {
// If this fails for some reason, let's dump the unit status
// into our logs to aid future debugging.
cmd := exec.Command("systemctl", "status", "rpm-ostreed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does printing status will capture enough error logs or should we print here rpm-ostreed journal log from current boot?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status will capture the last few lines of logs at the point in time this happens, which I think will be enough.

openshift/must-gather#244 will help for the general case.

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 28, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2021
@kikisdeliveryservice
Copy link
Contributor

/test e2e-agnostic-upgrade

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 28, 2021

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-disruptive 1de3aa9 link /test e2e-aws-disruptive
ci/prow/okd-e2e-aws 1de3aa9 link /test okd-e2e-aws
ci/prow/e2e-metal-ipi 1de3aa9 link /test e2e-metal-ipi
ci/prow/e2e-vsphere-upgrade 1de3aa9 link /test e2e-vsphere-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 2bbe717 into openshift:master Jun 28, 2021
cgwalters added a commit to cgwalters/rpm-ostree that referenced this pull request Jun 29, 2021
This is a hackier an alternative to coreos#2932
that we can ship immediately because we won't block on SELinux policy.

For historical reasons, the daemon ends up doing a lot of
initialization before even claiming the DBus name.  For example,
it calls `ostree_sysroot_load()`.  We also end up scanning
the RPM database, and actually parse all the GPG keys
in `/etc/pki/rpm-gpg` etc.

This causes two related problems:

- By doing all this work before claiming the bus name, we
  race against the (pretty low) DBus service timeout of 25s.
- If something hard fails at initialization, the client can't
  easily see the error because it just appears in the systemd
  journal.  The client will just see a service timeout.

By explicitly using `systemctl start rpm-ostreed.service`,
systemd does all of the error checking for us without involving
`dbus-broker` as a middleman.

Further, by using `systemctl status rpm-ostreed.service` on
failure, we reuse systemd's nice rendering of the status
of the unit instead of reinventing our own.

This PR effectively replicates openshift/machine-config-operator#2642
in our code.
@cgwalters
Copy link
Member Author

just to xref, coreos/rpm-ostree#2945 is driving this down into rpm-ostree, so eventually we'll want to remove this code. But right now, shipping this PR involved literally nothing more than me submitting it and having it merge, but changing rpm-ostree in RHEL will take much longer and involve a lot more exciting paperwork.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this pull request Jun 30, 2021
This is a hackier an alternative to coreos#2932
that we can ship immediately because we won't block on SELinux policy.

For historical reasons, the daemon ends up doing a lot of
initialization before even claiming the DBus name.  For example,
it calls `ostree_sysroot_load()`.  We also end up scanning
the RPM database, and actually parse all the GPG keys
in `/etc/pki/rpm-gpg` etc.

This causes two related problems:

- By doing all this work before claiming the bus name, we
  race against the (pretty low) DBus service timeout of 25s.
- If something hard fails at initialization, the client can't
  easily see the error because it just appears in the systemd
  journal.  The client will just see a service timeout.

By explicitly using `systemctl start rpm-ostreed.service`,
systemd does all of the error checking for us without involving
`dbus-broker` as a middleman.

Further, by using `systemctl status rpm-ostreed.service` on
failure, we reuse systemd's nice rendering of the status
of the unit instead of reinventing our own.

This PR effectively replicates openshift/machine-config-operator#2642
in our code.
cgwalters added a commit to coreos/rpm-ostree that referenced this pull request Jun 30, 2021
This is a hackier an alternative to #2932
that we can ship immediately because we won't block on SELinux policy.

For historical reasons, the daemon ends up doing a lot of
initialization before even claiming the DBus name.  For example,
it calls `ostree_sysroot_load()`.  We also end up scanning
the RPM database, and actually parse all the GPG keys
in `/etc/pki/rpm-gpg` etc.

This causes two related problems:

- By doing all this work before claiming the bus name, we
  race against the (pretty low) DBus service timeout of 25s.
- If something hard fails at initialization, the client can't
  easily see the error because it just appears in the systemd
  journal.  The client will just see a service timeout.

By explicitly using `systemctl start rpm-ostreed.service`,
systemd does all of the error checking for us without involving
`dbus-broker` as a middleman.

Further, by using `systemctl status rpm-ostreed.service` on
failure, we reuse systemd's nice rendering of the status
of the unit instead of reinventing our own.

This PR effectively replicates openshift/machine-config-operator#2642
in our code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants