Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation with agent installer or assisted installer with UPI on baremetal fails for v4.16.0-0.okd-scos-2024-08-21-155613 #2018

Open
titou10titou10 opened this issue Aug 22, 2024 · 5 comments
Labels
OKD SCOS 4.16 pre-release-testing Items related to testing nightlies before a release.

Comments

@titou10titou10
Copy link

titou10titou10 commented Aug 22, 2024

Context

Trying to install a cluster (3 masters + 2 workers):

  • with OKD-SCOS v4.16.0-0.okd-scos-2024-08-21-155613 ("stable" version)
  • on baremetal with UPI
  • using agent based installer (ABI) or Assisted Installer
  • on Proxmox (qemu) with 5 vms : 32G RAM, 100G SSD, 6 vCPUS

It is important to note that the install works perfectly well with the exact same agent and install config files for

  • OKD-SCOS v4.15.0-0.okd-scos-2024-01-18-223523 ("stable")
  • OKD-FCOS v4.15.0-0.okd-2024-03-10-010116 ("stable")

Summary

It fails with the following error from the "release-image-pivot" service:

okd5-master1 bootstrap-pivot.sh[25771]: error: Remounting /sysroot read-write: Permission denied

The cause of the problem is the OS image used as bootstrap: fedora-coreos-39.20231101.3.0-live.x86_64.iso

Details

All the details with debug info and configuration files are described in this discussion. The logs there etc are for v4.16.0-0.okd-scos-2024-08-01-132038 but they are the same for v4.16.0-0.okd-scos-2024-08-21-155613

Workarounds

Overriding the bootstrap OS image with a RHCOS image make the installation succeed

I did not choose a random bootstrap OS image, this is the one for v4.16 specified for an OCP installation via the ABI as specified here: https://github.com/openshift/assisted-service/blob/d3324b06a7c7772f4619c3ab13dd8c0706e55fd9/deploy/podman/configmap.yml#L25

It's probably possible to use another rhcos image as during the install process, the nodes upgrades to v418.9.202408211033-0

rpm-ostree status
State: idle
Deployments:
● ostree-unverified-registry:quay.io/okd/scos-content@sha256:3f4ca57e8ec68fb5a8ba5e2461c69162e211adba667dac299baf58ccf7923dad
                   Digest: sha256:3f4ca57e8ec68fb5a8ba5e2461c69162e211adba667dac299baf58ccf7923dad
                  Version: 418.9.202408211033-0 (2024-08-21T10:39:04Z)

Workaround for an Agent Installer (ABI) successful install:

Before building the ISO image, override the bootstrap OS image like this:

export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.16/4.16.3/rhcos-4.16.3-x86_64-live.x86_64.iso
oc adm release extract --command=openshift-install quay.io/okd/scos-release:4.16.0-0.okd-scos-2024-08-21-155613
./openshift-install agent create image --dir install --log-level=debug

Workaround for an Assisted Installer successfull install:

The procedure is described here: https://github.com/openshift/assisted-service/tree/master/deploy/podman
In the okd-configmap.yml file, replace (at least) the following variables:

OS_IMAGES: '[{"openshift_version":"4.16","cpu_architecture":"x86_64","url":"https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.16/4.16.3/rhcos-4.16.3-x86_64-live.x86_64.iso","version":"416.94.202406251923-0"}]'
RELEASE_IMAGES: '[{"openshift_version":"4.16","cpu_architecture":"x86_64","cpu_architectures":["x86_64"],"url":"quay.io/okd/scos-release:4.16.0-0.okd-scos-2024-08-21-155613","version":"4.16.0-0.okd-scos-2024-08-21-155613","default":true,"support_level":"beta"}]'
@titou10titou10 titou10titou10 changed the title Install with agent installer or assisted installer fails for 4.16.0-0.okd-scos-2024-08-21-155613 Installation with agent installer or assisted installer with UPI on baremetal fails for v4.16.0-0.okd-scos-2024-08-21-155613 Aug 22, 2024
@JaimeMagiera JaimeMagiera added OKD SCOS 4.16 pre-release-testing Items related to testing nightlies before a release. labels Aug 23, 2024
@0xHexE
Copy link

0xHexE commented Aug 25, 2024

Hi @titou10titou10,

I tried the workaround but I think rhel is missing zincati quay.io/okd/scos-content@sha256:cb68498aceefa81f105c4ce6c74787c3e1281d141725b0e20df555aa549dc5aa this container exists with

Error msg: error running preset on unit: Failed to preset unit: Unit file zincati.service does not exist.\n)\nI0825 06:38:53.624260    6508 file_writers.go:293] Writing systemd unit \"install-to-disk.service\"\n"

and installation stuck at Installing: bootstrap. I even creating dummy zincati.service still fails.

@0xHexE
Copy link

0xHexE commented Aug 25, 2024

I spoke too soon,

It took some hours to get reflected in the console. It turns out the zincati is not required.

And the bootkube commands take a while and while running doesn't create any logs in systemctl or change status while in running.

There was one issue though had to run this code to fix the network I am setting up single node installation

cat << EOF | tee /etc/kubernetes/cni/net.d/10-containerd-net.conflist
{
 "cniVersion": "1.0.0",
 "name": "containerd-net",
 "plugins": [
   {
     "type": "bridge",
     "bridge": "cni0",
     "isGateway": true,
     "ipMasq": true,
     "promiscMode": true,
     "ipam": {
       "type": "host-local",
       "ranges": [
         [{
           "subnet": "10.128.0.0/14"
         }]
       ],
       "routes": [
         { "dst": "0.0.0.0/0" },
         { "dst": "::/0" }
       ]
     }
   },
   {
     "type": "portmap",
     "capabilities": {"portMappings": true},
     "externalSetMarkChain": "KUBE-MARK-MASQ"
   }
 ]
}
EOF

Ref: #1966

@titou10titou10
Copy link
Author

titou10titou10 commented Aug 25, 2024

I'm not sure what exactly your code is doing but maybe you are not aware that "extra" manifests can be added before the creation of the iso image. Inside the directory where you set the install-config and agent-config files, create an "openshift" directory and create additional manifests:

Refs:

This page seems related to what you are doing, and maybe you can create a manifest with it and put in under the install/openshift directory?

In my install, I have this extra "network-03-config.yaml" manifest file in install/openshift:

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  defaultNetwork:
    ovnKubernetesConfig:
      genevePort: 6082
      # not necessary as OKD detects the underlying MTU and set the value to 9000-100 by itself
      mtu: 8900
      ipsecConfig:
        mode: Disabled
      ipv4:
        internalJoinSubnet: 100.65.0.0/16
        internalTransitSwitchSubnet: 100.89.0.0/16

@0xHexE
Copy link

0xHexE commented Aug 25, 2024

When I boot the OKD control for first time the network plugin was not configured in journalctl I had log saying No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started so I created that file manually. I had dual stack configuration maybe that caused. I am installing it again let's see if I am getting the same issue. I think this caused because of some bug.

After some time I restarted the server actually couple of time after that ovn was not working at all. So I am trying to reinstall. I had some issues in my network I resolved them let's see if it works or not.

@0xHexE
Copy link

0xHexE commented Aug 25, 2024

I was being too desperate it took some time and then the No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started gone away.

But @titou10titou10 thanks a lot for the investigation it was really big help saved a ton of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OKD SCOS 4.16 pre-release-testing Items related to testing nightlies before a release.
Projects
None yet
Development

No branches or pull requests

3 participants