Skip to content

NO-JIRA: Switch layered build to treefile-apply, drain get-ocp-repo.sh#1780

Merged
jlebon merged 11 commits intoopenshift:masterfrom
jlebon:pr/nuke-okd-c9s
Apr 4, 2025
Merged

NO-JIRA: Switch layered build to treefile-apply, drain get-ocp-repo.sh#1780
jlebon merged 11 commits intoopenshift:masterfrom
jlebon:pr/nuke-okd-c9s

Conversation

@jlebon
Copy link
Copy Markdown
Member

@jlebon jlebon commented Mar 28, 2025

See individual commit messages.

@jlebon jlebon changed the title Nuke okd-c9s variant NO-JIRA: Nuke okd-c9s variant Mar 28, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 28, 2025
@openshift-ci-robot
Copy link
Copy Markdown

@jlebon: This pull request explicitly references no jira issue.

Details

In response to this:

This is not built anywhere by anyone. OKD has moved to the new layered node image model and uses the output from the c9s variant we currently build internally.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from c4rt0 and marmijo March 28, 2025 16:03
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 28, 2025
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Mar 28, 2025

[1/3] STEP 9/9: RUN --mount=type=secret,id=yumrepos,target=/os/secret.repo if [[ -n "${VARIANT}" ]]; then MANIFEST="manifest-${VARIANT}.yaml"; EXTENSIONS="extensions-${VARIANT}.yaml"; else MANIFEST="manifest.yaml"; EXTENSIONS="extensions.yaml"; fi && rpm-ostree compose extensions --rootfs=/ --output-dir=/usr/share/rpm-ostree/extensions/ ./"${MANIFEST}" ./"${EXTENSIONS}"
error: Can't open file "./manifest-okd-c9s.yaml" for reading: No such file or directory (os error 2) 

Man, this get-ocp-repo.sh script has become quite the beast. I'm thinking of reworking how that works entirely to make it saner.

@jlebon jlebon changed the title NO-JIRA: Nuke okd-c9s variant NO-JIRA: Switch layered build to treefile-apply, drain get-ocp-repo.sh Mar 31, 2025
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Mar 31, 2025

This requires coreos/rpm-ostree#5351 and coreos/coreos-assembler#4054.

@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Mar 31, 2025

cc @Prashanth684 since this also touches OKD

Copy link
Copy Markdown
Contributor

@jbtrystram jbtrystram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome !
Just a couple of questions :)

Comment thread common.yaml
Comment thread extensions-okd-c9s.yaml
Comment thread Containerfile Outdated
Comment thread README.md
- `rhel-9.6`: RHEL 9.6-based CoreOS; without OpenShift components.
- `ocp-rhel-9.6`: RHEL 9.6-based CoreOS; including OpenShift components.
- `c9s`: CentOS Stream-based CoreOS, without OKD components.
- `okd-c9s`: CentOS Stream-based CoreOS, including OpenShift components. This
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the okd-c9s variant used in the okd/scos build pipeline [1] run in MOC and more specifically in the latest commit okd-project/okd-coreos-pipeline@d4be53e for 4.19.
But according to openshift/release#62296 the scos imagestream (to be used as node image) is now populated by the OpenShift CI itself instead of the MOC pipeline.

So maybe we should decommission the MOC pipeline [2] before merging this patch ? What do you think @Prashanth684 ? It's not a blocker though, the MOC builds would just fail and can be deal as a follow-up.

[1] https://github.com/search?q=repo%3Aokd-project%2Fokd-coreos-pipeline%20okd-c9s&type=code
[2] https://github.com/okd-project/okd-coreos-pipeline

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The okd-c9s "variant" used in that pipeline is for the extensions build, not the base OS AFAIK.

That said, there is indeed a small cleanup possible there which is that it no longer needs to provide a VARIANT argument to the extensions build since that's auto-detected now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe we should decommission the MOC pipeline [2] before merging this patch ? What do you think @Prashanth684 ? It's not a blocker though, the MOC builds would just fail and can be deal as a follow-up.

Correct. MOC is only used for 4.18. Once we release 4.19 as stable, we will stop those also. We are working to migrate off MOC (we still do OKD release promotions from there) to an internal cluster.

@joelcapitao
Copy link
Copy Markdown
Contributor

It looks like the CI base image does not contain the /etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial file containing the CentOS Stream GPG keys.

@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Apr 1, 2025

It looks like the CI base image does not contain the /etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial file containing the CentOS Stream GPG keys.

Yeah, it needed coreos/coreos-assembler#4054. That's merged now but there's no point retriggering the tests until coreos/rpm-ostree#5351 percolates down too. We're working on that.

Comment thread ci/get-ocp-repo.sh
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Apr 2, 2025

/retest

@jlebon jlebon force-pushed the pr/nuke-okd-c9s branch from b8fbcd9 to ba35f21 Compare April 2, 2025 19:10
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Apr 2, 2025

OK, we have new enough rpm-ostree and cosa now. Let's try this out!

@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Apr 2, 2025

@Prashanth684 Can you take of this bit once this lands:

That said, there is indeed a small cleanup possible there which is that it no longer needs to provide a VARIANT argument to the extensions build since that's auto-detected now.

Copy link
Copy Markdown
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM - some comments

Comment thread Containerfile
Comment thread extensions/Dockerfile Outdated
Comment on lines +44 to +45
# buildah doesn't seem to support heredoc output
# redirection like buildkit so do it manually here
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a bug for this we can link to?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find one, but I could file one I guess.

Ahhh OK, just saw your other comment below related to this. The feature I'm talking about here isn't just bash redirection, but a RUN feature so you can write e.g.

RUN <<EOF > /out
echo foobar

and it'll go to /out. Which would've been perfect for our use case here.

Comment thread extensions/Dockerfile Outdated
Comment on lines +16 to +17
MANIFEST="manifest-rhel-9.6.yaml"
EXTENSIONS="extensions-ocp-rhel-9.6.yaml"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply these paths will need to be updated each time we say bump to the next version of RHEL? (i.e. RHEL 9.7 or 9.8 are here?)

I would kind of prefer that we didn't have to remember to update these variables when that happens.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 9.7, but 9.8 yeah. It would be part of all the tree updates we do after branching for 4.22 I guess? When bumping the base variant from rhel-9.6 to rhel-9.8. Though by then we should be on top of rhel-bootc which likely means we can rework this too since it's then driven by the base RHEL version of the rhel-bootc image.

Comment thread ci/get-ocp-repo.sh
Comment thread Containerfile Outdated
Comment thread Containerfile Outdated
@jlebon jlebon force-pushed the pr/nuke-okd-c9s branch from ba35f21 to 7bbe649 Compare April 3, 2025 03:19
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Apr 3, 2025

Hmm, the builder is choking on the heredocs. I think possibly the Dockerfile parser there (which happens before it's handed off to buildah) is getting tripped up on something. :( I'll dig into this a bit, but if it can't be easily worked around, I think I'll just add a commit for now that moves the heredocs to shell scripts for now.

c9s builds are failing on

 error: Installing packages: Updating rpm-md repo 'c9s-baseos-mirror': Failed to download gpg key for repo 'c9s-baseos-mirror': Curl error (37): Could not read a file:// file for file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial [Couldn't open file /etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial] 

which I have no idea why. It's using new enough cosa with coreos/coreos-assembler#4054. And using that new cosa locally I can't reproduce this.

Copy link
Copy Markdown
Contributor

@joelcapitao joelcapitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my inline comment, sorry I missed it in my previous review.
Otherwise LGTM 👍

Comment thread c9s.repo Outdated
@joelcapitao
Copy link
Copy Markdown
Contributor

Hmm, the builder is choking on the heredocs. I think possibly the Dockerfile parser there (which happens before it's handed off to buildah) is getting tripped up on something. :( I'll dig into this a bit, but if it can't be easily worked around, I think I'll just add a commit for now that moves the heredocs to shell scripts for now.

c9s builds are failing on

 error: Installing packages: Updating rpm-md repo 'c9s-baseos-mirror': Failed to download gpg key for repo 'c9s-baseos-mirror': Curl error (37): Could not read a file:// file for file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial [Couldn't open file /etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial] 

which I have no idea why. It's using new enough cosa with coreos/coreos-assembler#4054. And using that new cosa locally I can't reproduce this.

It's because the latest run CI job pull cosa commit b45a4066b16a2332517f659111d4f474372f77d5 [1]

+ [[ -d /cosa ]]
+ jq .
{
  "date": "2025-04-02T17:18:40Z",
  "git": {
    "commit": "b45a4066b16a2332517f659111d4f474372f77d5",
    "origin": "https://github.com/coreos/coreos-assembler.git",
    "branch": "HEAD",
    "dirty": "false"
  },

and not the latest one with the changes we want coreos/coreos-assembler@ae0e86a

Maybe some cache issue or race condition issue ? I'm not yet familiar with the CI workflow to be sure, maybe a simple /retest should work

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_os/1780/pull-ci-openshift-os-master-scos-9-build-test-qemu/1907634163477909504/build-log.txt

@joelcapitao
Copy link
Copy Markdown
Contributor

Hmm, the builder is choking on the heredocs. I think possibly the Dockerfile parser there (which happens before it's handed off to buildah) is getting tripped up on something. :( I'll dig into this a bit, but if it can't be easily worked around, I think I'll just add a commit for now that moves the heredocs to shell scripts for now.
c9s builds are failing on

 error: Installing packages: Updating rpm-md repo 'c9s-baseos-mirror': Failed to download gpg key for repo 'c9s-baseos-mirror': Curl error (37): Could not read a file:// file for file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial [Couldn't open file /etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial] 

which I have no idea why. It's using new enough cosa with coreos/coreos-assembler#4054. And using that new cosa locally I can't reproduce this.

It's because the latest run CI job pull cosa commit b45a4066b16a2332517f659111d4f474372f77d5 [1]

+ [[ -d /cosa ]]
+ jq .
{
  "date": "2025-04-02T17:18:40Z",
  "git": {
    "commit": "b45a4066b16a2332517f659111d4f474372f77d5",
    "origin": "https://github.com/coreos/coreos-assembler.git",
    "branch": "HEAD",
    "dirty": "false"
  },

and not the latest one with the changes we want coreos/coreos-assembler@ae0e86a

Maybe some cache issue or race condition issue ? I'm not yet familiar with the CI workflow to be sure, maybe a simple /retest should work

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_os/1780/pull-ci-openshift-os-master-scos-9-build-test-qemu/1907634163477909504/build-log.txt

Hmm forget about it, wrong assumption, the codebase up to b45a4 contains the changes we want from https://github.com/coreos/coreos-assembler/pull/4054/commits
That's odd it does not work on CI

@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Apr 3, 2025

OK, coreos/coreos-assembler#4059 should fix the GPG issue!

@jbtrystram
Copy link
Copy Markdown
Contributor

/retest

@jlebon jlebon force-pushed the pr/nuke-okd-c9s branch 2 times, most recently from 100d9f8 to ebd3ff7 Compare April 4, 2025 14:18
jlebon added 2 commits April 4, 2025 10:50
Now that (1) we've reworked the layered node image build to only enable
the repos it needs, and (2) we've simplified the CentOS Stream GPG keys,
we can delete all of the complex logic in this repo. It basically just
boils down to curl'ing down all the repo files we may need to build the
various artifacts that use this script.
We only want certain packages to come from the 4.19 plashet. And we
can't just rely on NVRs because the plashet may sometimes win. Long-term
we should sever that dependence on ART packages, but for now, let's add
a hack to essentially generate a repo on the fly from the 4.19 repo with
the filters we need.

The advantage of doing it this way instead of e.g. in the
`get-ocp-repo.sh` script is that this applies both in CI and locally.
@jlebon jlebon force-pushed the pr/nuke-okd-c9s branch from ebd3ff7 to ac98e93 Compare April 4, 2025 14:51
jlebon added 2 commits April 4, 2025 11:26
The OCP builder API path isn't parsing the heredoc correctly for some
reason:

     error: build error: EOF: unterminated heredoc

This will be fixed by openshift/builder#469.

Anyway, just work around this for now by moving all the logic to
scripts. It does make the Containerfiles cleaner at least now that it
has gotten so larger and we get syntax highlighting, ShellCheck, etc...
so probably for the best.
Before we inherited this from the ocp-rhel-9.6 manifest. But now that
we're inheriting from the rhel-9.6 manifest, that repo isn't enabled
by default there since it's not strictly needed (because we don't ship
openvswitch in the base).

So we need to enable it here ourselves.
@jlebon jlebon force-pushed the pr/nuke-okd-c9s branch from ac98e93 to 7997d1a Compare April 4, 2025 15:35
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Apr 4, 2025

OK, the OKD node image build is failing on

Error: Unknown repo: 'c9s-baseos'

which I know why. Got a fix for that, but let's see if CI for the other tests pass. If they do, then let's get this in, and I'll add my fix to #1498 instead when I rebase it.

@dustymabe
Copy link
Copy Markdown
Member

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 4, 2025
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dustymabe, jbtrystram, jlebon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [dustymabe,jbtrystram,jlebon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Copy Markdown

/retest-required

Remaining retests: 0 against base HEAD 48a1891 and 2 for PR HEAD 7997d1a in total

@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented Apr 4, 2025

This will known fail. It'll be fixed in #1498.

/override ci/prow/okd-scos-images

OK, as for the RHCOS failure, it seems to have just... timed out?

 --- FAIL: ext.config.shared.rpm-ostree.kernel-replace (1206.73s)
        harness.go:106: TIMEOUT[20m0s]: ssh: sudo /usr/local/bin/kolet run-test-unit kola-runext.service
        harness.go:106: TIMEOUT[20m0s]: ssh: journalctl -t kola-runext-kernel-replace 

Nothing in particular in the journal logs. Last operation was

Apr  4 18:20:36.204321 rpm-ostreed.service[1391]: Fetching ostree-unverified-image:oci-archive:/var/tmp/coreos-derived.ociarchive

So possibly a genuine timeout because of e.g. slow I/O.

Anyway, don't think it needs to block this. It'll rerun in #1498.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 4, 2025

@jlebon: Overrode contexts on behalf of jlebon: ci/prow/okd-scos-images

Details

In response to this:

This will known fail. It'll be fixed in #1498.

/override ci/prow/okd-scos-images

OK, as for the RHCOS failure, it seems to have just... timed out?

--- FAIL: ext.config.shared.rpm-ostree.kernel-replace (1206.73s)
       harness.go:106: TIMEOUT[20m0s]: ssh: sudo /usr/local/bin/kolet run-test-unit kola-runext.service
       harness.go:106: TIMEOUT[20m0s]: ssh: journalctl -t kola-runext-kernel-replace 

Nothing in particular in the journal logs. Last operation was

Apr  4 18:20:36.204321 rpm-ostreed.service[1391]: Fetching ostree-unverified-image:oci-archive:/var/tmp/coreos-derived.ociarchive

So possibly a genuine timeout because of e.g. slow I/O.

Anyway, don't think it needs to block this. It'll rerun in #1498.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 4, 2025

@jlebon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 7997d1a link false /test okd-scos-e2e-aws-ovn
ci/prow/rhcos-9-build-test-qemu 7997d1a link true /test rhcos-9-build-test-qemu
ci/prow/e2e-aws 7997d1a link false /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jlebon jlebon merged commit 902882d into openshift:master Apr 4, 2025
10 of 14 checks passed
@jlebon jlebon deleted the pr/nuke-okd-c9s branch April 4, 2025 18:51
jbtrystram added a commit to jbtrystram/openshift-os that referenced this pull request Apr 7, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
jbtrystram added a commit to jbtrystram/openshift-os that referenced this pull request Apr 7, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
jbtrystram added a commit to jbtrystram/openshift-os that referenced this pull request Apr 7, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
jbtrystram added a commit to jbtrystram/openshift-os that referenced this pull request Apr 7, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
dustymabe pushed a commit to dustymabe/os that referenced this pull request Apr 8, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
dustymabe pushed a commit to jbtrystram/openshift-os that referenced this pull request Apr 8, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
dustymabe pushed a commit to jbtrystram/openshift-os that referenced this pull request Apr 8, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
dustymabe pushed a commit to jbtrystram/openshift-os that referenced this pull request Apr 8, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
joelcapitao added a commit to joelcapitao/release that referenced this pull request Apr 17, 2025
Since [1], the extension build detects automatically the release
based on the builder image [2]. So this argument can be removed.

[1] openshift/os#1780
[2] https://github.com/openshift/os/blob/master/extensions/build.sh#L9-L23
joelcapitao added a commit to joelcapitao/release that referenced this pull request Apr 17, 2025
Since [1], the extension build detects automatically the release
based on the builder image [2]. So this argument can be removed.

[1] openshift/os#1780
[2] https://github.com/openshift/os/blob/master/extensions/build.sh#L9-L23
joelcapitao added a commit to joelcapitao/release that referenced this pull request Apr 22, 2025
Since [1], the extension build detects automatically the release
based on the builder image [2]. So this argument can be removed.

[1] openshift/os#1780
[2] https://github.com/openshift/os/blob/master/extensions/build.sh#L9-L23
joelcapitao added a commit to joelcapitao/release that referenced this pull request Apr 22, 2025
Since [1], the extension build detects automatically the release
based on the builder image [2]. So this argument can be removed.

[1] openshift/os#1780
[2] https://github.com/openshift/os/blob/master/extensions/build.sh#L9-L23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants