Split RHCOS into layers #1637

jlebon · 2024-06-07T15:25:03Z

This enhancement describes improvements to the way RHEL CoreOS (RHCOS) is built so that it will better align with image mode for RHEL, all while also providing benefits on the OpenShift side. Currently, RHCOS is built as a single layer that includes both RHEL and OCP content. This enhancement proposes splitting it into three layers. Going from bottom to top:

the (RHEL-versioned) bootc layer (i.e. the base rhel-bootc image shared with image mode for RHEL)
the (RHEL-versioned) CoreOS layer (i.e. coreos-installer, ignition, afterburn, scripts, etc...)
the (OCP-versioned) node layer (i.e. kubelet, cri-o, etc...)

The terms "bootc layer", "CoreOS layer", and "node layer" will be used throughout this enhancement to refer to these.

The details of this enhancement focus on doing the first split: creating the node layer as distinct from the CoreOS layer (which will not yet be rebased on top of a bootc layer). The two changes involved which most affect OCP are:

bootimages will no longer contain OCP components (e.g. kubelet, cri-o, etc...)
the rhel-coreos payload image will be built in Prow/Konflux (as any other)

Tracked at: https://issues.redhat.com/browse/OCPSTRAT-1190

openshift-ci · 2024-06-07T15:25:08Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

enhancements/rhcos/split-rhcos-into-layers.md

cgwalters

Awesome work on this!

enhancements/rhcos/split-rhcos-into-layers.md

rphillips · 2024-06-12T16:00:48Z

openshift/kubernetes has a specific workflow where jobs will build a new kubelet to use during the job run. This helps with rebase work and validating new kubernetes versions coming into OpenShift. We should preserve this workflow when migrating to RHCOS layering.

/cc @soltysh

jlebon · 2024-06-12T16:39:18Z

openshift/kubernetes has a specific workflow where jobs will build a new kubelet to use during the job run. This helps with rebase work and validating new kubernetes versions coming into OpenShift. We should preserve this workflow when migrating to RHCOS layering.

/cc @soltysh

I don't expect any issues there. That workflow should keep working as is.

enhancements/rhcos/split-rhcos-into-layers.md

zaneb · 2024-06-24T11:59:26Z

/cc @cybertron @andfasano

soltysh · 2024-06-26T12:04:05Z

I don't expect any issues there. That workflow should keep working as is.

I believe this was the pre-req work done in openshift/kubernetes#1805, which ensured we won't have problems in o/k.

enhancements/rhcos/split-rhcos-into-layers.md

jlebon · 2024-07-16T21:34:59Z

OK, so let's resume the bootstrapping issue. Restating some of the things from above and from researching further:

We can't run the kubelet in a container because it's no longer supported.
The delta between kubelet and podman play is too large to make the latter a feasible replacement.
systemctl soft-reboot is not in RHEL9.
In the AI/ABI/SNO cases, bootstrapping happens in the live environment where e.g. rebooting is not possible.
I considered cobbling something around kexec, but in the limit, there are potential issues with kexec and hardware reliability, as well as how it meshes with Secure Boot.

What I'm playing with now is basically to have a special node-image-pivot.target that the node isolates to first. There, we pull the node image, unencapsulate it, check out its contents, and then mount over /usr and do a rough 3-way /etc merge. We then isolate back to multi-user.target to continue with the bootstrapping process.

This is in effect like a more aggressive bootc/rpm-ostree apply-live, though that doesn't currently work in live environments. (Though even in the non-live case, there are some issues there that would need to be resolved.) It's close to what OKD currently does when using FCOS live media today, though using the ostree stack and isolating targets should make this more robust.

WIP for this in openshift/installer#8742.

rphillips · 2024-07-16T22:32:52Z

@jlebon That sounds like it might work. Where will the Kubelet be coming from? An OpenShift built image?

zaneb · 2024-07-17T03:52:23Z

Won't doing systemctl isolate node-image-pivot.targethave the effect of stopping the assisted/agent services that we need to avoid stopping?

jlebon · 2024-07-17T14:40:22Z

@jlebon That sounds like it might work. Where will the Kubelet be coming from? An OpenShift built image?

From the node image (i.e. for OCP, the rhel-coreos image in the release payload).

Won't doing systemctl isolate node-image-pivot.targethave the effect of stopping the assisted/agent services that we need to avoid stopping?

No. The system boots into node-image-pivot.target first. Any other service hooked into multi-user.target aren't started until after we've finished the live pivot.

cgwalters · 2024-07-17T14:50:32Z

The system boots into node-image-pivot.target first.

Via a generator overriding default.target?

These images are built as part of the CoreOS pipeline. They will be used as bases for building the node images containing OCP-versioned content for CI. Part of openshift/enhancements#1637.

As part of openshift/enhancements#1637, we want to start building the node image as a layered build on top of an RHCOS base image. For now, promote this image as `node`. In the future, when we're ready to switch CI over to the node image, it'll take the place of `rhel-coreos`.

As part of openshift/enhancements#1637, we want to start building the node image as a layered build on top of an RHCOS base image. For now, don't promote this image. In the future, when we're ready to switch CI over, it'll take the place of `rhel-coreos`.

* openshift/os: start building node image As part of openshift/enhancements#1637, we want to start building the node image as a layered build on top of an RHCOS base image. For now, don't promote this image. In the future, when we're ready to switch CI over, it'll take the place of `rhel-coreos`. * openshift/os: add an e2e-aws test Now that we're building the node image in CI, we can run cluster tests with it. Let's start simple for now with just the standard e2e-aws test. Note that it doesn't run by default. This means that we can request it on specific PRs only using `/test`.

As per openshift/enhancements#1637, we're trying to get rid of all OpenShift-versioned components from the bootimages. This means that there will no longer be `oc`, `kubelet`, or `crio` binaries for example, which bootstrapping obviously relies on. Instead, now we change things up so that early on when booting the bootstrap node, we pull down the node image, unencapsulate it (this just means convert it back to an OSTree commit), then mount over its `/usr`, and import new `/etc` content. This is done by isolating to a different systemd target to only bring up the minimum number of services to do the pivot and then carry on with bootstrapping. This does not incur additional reboots and should be compatible with AI/ABI/SNO. But it is of course, a huge conceptual shift in how bootstrapping works. With this, we would now always be sure that we're using the same binaries as the target version as part of bootstrapping, which should alleviate some issues such as AI late-binding (see e.g. https://issues.redhat.com/browse/MGMT-16705). The big exception of course being the kernel. Relatedly, note we do persist `/usr/lib/modules` from the booted system so that loading kernel modules still works. To be conservative, the new logic only kicks in when using bootimages which do not have `oc`. This will allow us to ratchet this in more easily. Down the line, we should be able to replace some of this with `bootc apply-live` once that's available (and also works in a live environment). (See containers/bootc#76.) For full context, see the linked enhancement and discussions there.

As per openshift/enhancements#1637, we're trying to get rid of all OpenShift-versioned components from the bootimages. This means that there will no longer be oc, kubelet, or crio binaries for example, which bootstrapping obviously relies on. To adapt to this, the OpenShift installer now ships a new `node-image-overlay.service` in its bootstrap Ignition config. This service takes care of pulling down the node image and overlaying it, effectively updating the system to the node image version. Here, we accordingly also adapt assisted-installer so that we run `node-image-overlay.service` before starting e.g. `kubelet.service` and `bootkube.service`. See also: openshift/installer#8742

These images are built as part of the CoreOS pipeline. They will be used as bases for building the node images containing OCP-versioned content for CI. Part of openshift/enhancements#1637.

* openshift/os: start building node image As part of openshift/enhancements#1637, we want to start building the node image as a layered build on top of an RHCOS base image. For now, don't promote this image. In the future, when we're ready to switch CI over, it'll take the place of `rhel-coreos`. * openshift/os: add an e2e-aws test Now that we're building the node image in CI, we can run cluster tests with it. Let's start simple for now with just the standard e2e-aws test. Note that it doesn't run by default. This means that we can request it on specific PRs only using `/test`.

As per openshift/enhancements#1637, we're trying to get rid of all OpenShift-versioned components from the bootimages. This means that there will no longer be `oc`, `kubelet`, or `crio` binaries for example, which bootstrapping obviously relies on. Instead, now we change things up so that early on when booting the bootstrap node, we pull down the node image, unencapsulate it (this just means convert it back to an OSTree commit), then mount over its `/usr`, and import new `/etc` content. This is done by isolating to a different systemd target to only bring up the minimum number of services to do the pivot and then carry on with bootstrapping. This does not incur additional reboots and should be compatible with AI/ABI/SNO. But it is of course, a huge conceptual shift in how bootstrapping works. With this, we would now always be sure that we're using the same binaries as the target version as part of bootstrapping, which should alleviate some issues such as AI late-binding (see e.g. https://issues.redhat.com/browse/MGMT-16705). The big exception of course being the kernel. Relatedly, note we do persist `/usr/lib/modules` from the booted system so that loading kernel modules still works. To be conservative, the new logic only kicks in when using bootimages which do not have `oc`. This will allow us to ratchet this in more easily. Down the line, we should be able to replace some of this with `bootc apply-live` once that's available (and also works in a live environment). (See containers/bootc#76.) For full context, see the linked enhancement and discussions there.

As per openshift/enhancements#1637, we're trying to get rid of all OpenShift-versioned components from the bootimages. This means that there will no longer be oc, kubelet, or crio binaries for example, which bootstrapping obviously relies on. To adapt to this, the OpenShift installer now ships a new `node-image-overlay.service` in its bootstrap Ignition config. This service takes care of pulling down the node image and overlaying it, effectively updating the system to the node image version. Here, we accordingly also adapt assisted-installer so that we run `node-image-overlay.service` before starting e.g. `kubelet.service` and `bootkube.service`. See also: openshift/installer#8742

* ops: add new FileExists method Prep for next patch. Also use that in one spot where we were manually calling `stat`. * overlay node image before bootstrapping if necessary As per openshift/enhancements#1637, we're trying to get rid of all OpenShift-versioned components from the bootimages. This means that there will no longer be oc, kubelet, or crio binaries for example, which bootstrapping obviously relies on. To adapt to this, the OpenShift installer now ships a new `node-image-overlay.service` in its bootstrap Ignition config. This service takes care of pulling down the node image and overlaying it, effectively updating the system to the node image version. Here, we accordingly also adapt assisted-installer so that we run `node-image-overlay.service` before starting e.g. `kubelet.service` and `bootkube.service`. See also: openshift/installer#8742

As part of openshift/enhancements#1637, the version string for RHCOS will change from being OCP+RHEL-based (e.g. 419.96...) to being purely RHEL-based (e.g. 9.6...). Adapt the logic for this new scheme so that it links to the right stream. We should eventually clean this up though so that the stream name is available more directly to the release controller so that it doesn't need to do any guessing.

As part of openshift/enhancements#1637, the CoreOS pipeline now only builds a RHEL-only RHCOS base image and later on, a node image is built on top of this base image to add all the OCP-specific packages. As a result, the RHCOS release browser will only display the diff of the _base_ image content, and will not have any OCP content. Often, that's sufficient. E.g. if you're just interested in kernel or systemd changes, the RHCOS release browser is enough. However, users can be confused by the lack of OCP packages in the list. Let's add an info box to the changelog page with instructions to generate a full diff. As a bonus, these instructions also conveniently serve as a way to get any diff at all without VPN access. Of course, being able to generate this diff ourselves and rendering it would be useful. And such a mechanism need not be specific to the CoreOS image; any of the many OCP images we ship which contain RPMs would benefit from being able to view package diffs. The most likely candidate for implementing this would be in `oc adm release info`, but downloading images to generate diffs is a much more expensive operation than the git changelog-based one. So a caching service might be better instead. (That said, it's possible with Konflux that we'll end up storing RPM lockfiles in git, in which case an RPM diff _is_ a git diff which matches nicely the existing semantics.)

As part of openshift/enhancements#1637, the CoreOS pipeline now only builds a RHEL-only RHCOS base image and later on, a node image is built on top of this base image to add all the OCP-specific packages. As a result, the RHCOS release browser will only display the diff of the _base_ image content, and will not have any OCP content. Often, that's sufficient. E.g. if you're just interested in kernel or systemd changes, the RHCOS release browser is enough. However, users can be confused by the lack of OCP packages in the list. Let's add an info box to the changelog page with instructions to generate a full diff. But only display it if one of the RHCOS versions is of the new RHEL-only kind. As a bonus, these instructions also conveniently serve as a way to get any diff at all without VPN access. Of course, being able to generate this diff ourselves and rendering it would be useful. And such a mechanism need not be specific to the CoreOS image; any of the many OCP images we ship which contain RPMs would benefit from being able to view package diffs. The most likely candidate for implementing this would be in `oc adm release info`, but downloading images to generate diffs is a much more expensive operation than the git changelog-based one. So a caching service might be better instead. (That said, it's possible with Konflux that we'll end up storing RPM lockfiles in git, in which case an RPM diff _is_ a git diff which matches nicely the existing semantics.)

It's often useful when looking up release images to know the list of RPM packages that shipped in the node image. Add new switches for this: - `oc adm release info --rpmdb $IMG` will list all the packages in the node image for the given release image payload - `oc adm release info --rpmdb-diff $IMG1 $IMG2` will diff the set of packages in the node image for the given release image payloads The code is generic over the actual target image. By default, the node image is used, but `--rpmdb-image` can be used to select a different one. The primary motivation for this is openshift/enhancements#1637, in which the node image will no longer be built within the CoreOS pipeline as a base image. Instead, it will be a layered image built in OpenShift CI/Konflux. As a result, all layered packages will not show up in the CoreOS release browser differ. With this functionality, the release controller will be able to render RPM diffs in the web UI, greatly de-emphasize the CoreOS differ and effectively dropping the requirement for having VPN access. Some notes on the implementation: - The rpmdb for a given image is cached, keyed by the image digest. - We don't try to be smart here and e.g. only download some layers. There are some issues with doing that. We literally do download the full image, _but_ we only cache the rpmdb content and throw away the rest. That said, the high cost isn't an issue in practice because the release controller can nicely represent operations which take time so it didn't feel worth the effort of trying to optimize this further. Once we have SBOMs available for all our images, this should be a much cheaper way to query its RPM contents. Additionally/alternatively, for the node image specifically, if we ever end up with lockfiles in the git repo, this would effectively mean that the git changelog _is_ the RPM changelog also, meshing nicely with the existing infrastructure around that.

These images are built as part of the CoreOS pipeline. They will be used as bases for building the node images containing OCP-versioned content for CI. Part of openshift/enhancements#1637.

* openshift/os: start building node image As part of openshift/enhancements#1637, we want to start building the node image as a layered build on top of an RHCOS base image. For now, don't promote this image. In the future, when we're ready to switch CI over, it'll take the place of `rhel-coreos`. * openshift/os: add an e2e-aws test Now that we're building the node image in CI, we can run cluster tests with it. Let's start simple for now with just the standard e2e-aws test. Note that it doesn't run by default. This means that we can request it on specific PRs only using `/test`.

It's often useful when looking up release images to know the list of RPM packages that shipped in the node image. Add new switches for this: - `oc adm release info --rpmdb $IMG` will list all the packages in the node image for the given release image payload - `oc adm release info --rpmdb-diff $IMG1 $IMG2` will diff the set of packages in the node image for the given release image payloads The code is generic over the actual target image. By default, the node image is used, but `--rpmdb-image` can be used to select a different one. The primary motivation for this is openshift/enhancements#1637, in which the node image will no longer be built within the CoreOS pipeline as a base image. Instead, it will be a layered image built in OpenShift CI/Konflux. As a result, all layered packages will not show up in the CoreOS release browser differ. With this functionality, the release controller will be able to render RPM diffs in the web UI, greatly de-emphasize the CoreOS differ and effectively dropping the requirement for having VPN access. Some notes on the implementation: - The rpmdb for a given image is cached, keyed by the image digest. - We don't try to be smart here and e.g. only download some layers. There are some issues with doing that. We literally do download the full image, _but_ we only cache the rpmdb content and throw away the rest. That said, the high cost isn't an issue in practice because the release controller can nicely represent operations which take time so it didn't feel worth the effort of trying to optimize this further. Once we have SBOMs available for all our images, this should be a much cheaper way to query its RPM contents. Additionally/alternatively, for the node image specifically, if we ever end up with lockfiles in the git repo, this would effectively mean that the git changelog _is_ the RPM changelog also, meshing nicely with the existing infrastructure around that.

As part of openshift/enhancements#1637, we've moved OCP 4.19 to use bootimages with RHEL content only. This means that the bootimages built with OCP content will never be used in practice. Add a `skip_disk_images` knob for this. We still generate the QEMU image for kola tests to run and because they're useful to debug, but we drop everything else. (Actually, we could also not generate the QEMU image either and sanity-check the OCI image with `kola run --oscontainer`, but that requires more rewiring.) We don't generate live media since I don't think the test coverage from that is meaningfully different enough from the RHEL-only variants given that the additional OCP packages are unrelated.

As part of openshift/enhancements#1637, we've moved OCP 4.19 to use bootimages with RHEL content only. This means that the bootimages built with OCP content will never be used in practice. Add a `skip_disk_images` knob to disable building them. We still generate the QEMU image for kola tests to run and because they're useful to debug, but we drop everything else. (Actually, we could also not generate the QEMU image either and sanity-check the OCI image with `kola run --oscontainer`, but that requires more rewiring.) We don't generate live media since I don't think the test coverage from that is meaningfully different enough from the RHEL-only variants given that the additional OCP packages are unrelated.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 7, 2024

jlebon mentioned this pull request Jun 7, 2024

Rework build process to generate rhel-coreos-base distinct from ocp-rhel-coreos openshift/os#799

Closed

jlebon force-pushed the pr/split-rhcos-into-layers branch from 067ece5 to f79684b Compare June 7, 2024 15:38

jlebon commented Jun 7, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Outdated Show resolved Hide resolved

cgwalters reviewed Jun 7, 2024

View reviewed changes

travier reviewed Jun 10, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

travier reviewed Jun 10, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

sdodson reviewed Jun 10, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

enhancements/rhcos/split-rhcos-into-layers.md Outdated Show resolved Hide resolved

openshift-ci bot requested a review from soltysh June 12, 2024 16:00

jlebon force-pushed the pr/split-rhcos-into-layers branch from f79684b to a6a7438 Compare June 20, 2024 21:15

zaneb reviewed Jun 24, 2024

View reviewed changes

openshift-ci bot requested review from andfasano and cybertron June 24, 2024 11:59

rphillips reviewed Jun 26, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Outdated Show resolved Hide resolved

romfreiman reviewed Jun 30, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

romfreiman reviewed Jun 30, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

romfreiman reviewed Jun 30, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

andfasano reviewed Jul 11, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Outdated Show resolved Hide resolved

jlebon mentioned this pull request Jul 16, 2024

COS-3013: overlay node image before bootstrapping if necessary openshift/installer#8742

Merged

jlebon mentioned this pull request Jan 27, 2025

rhcos: add info box with RPM diff command if necessary openshift/release-controller#648

Merged

jlebon mentioned this pull request Jan 28, 2025

pkg/cli/admin/release/info: support generating RPM diffs openshift/oc#1966

Open

jlebon mentioned this pull request Feb 4, 2025

config: add streams[].skip_disk_images coreos/fedora-coreos-pipeline#1088

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split RHCOS into layers #1637

Split RHCOS into layers #1637

jlebon commented Jun 7, 2024

openshift-ci bot commented Jun 7, 2024

cgwalters left a comment

rphillips commented Jun 12, 2024

jlebon commented Jun 12, 2024

zaneb commented Jun 24, 2024

soltysh commented Jun 26, 2024

jlebon commented Jul 16, 2024

rphillips commented Jul 16, 2024

zaneb commented Jul 17, 2024

jlebon commented Jul 17, 2024

cgwalters commented Jul 17, 2024

Split RHCOS into layers #1637

Split RHCOS into layers #1637

Conversation

jlebon commented Jun 7, 2024

openshift-ci bot commented Jun 7, 2024

cgwalters left a comment

Choose a reason for hiding this comment

rphillips commented Jun 12, 2024

jlebon commented Jun 12, 2024

zaneb commented Jun 24, 2024

soltysh commented Jun 26, 2024

jlebon commented Jul 16, 2024

rphillips commented Jul 16, 2024

zaneb commented Jul 17, 2024

jlebon commented Jul 17, 2024

cgwalters commented Jul 17, 2024