Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale UEFI boot entry left behind after reprovisioning #946

Open
bgilbert opened this issue Aug 31, 2021 · 19 comments
Open

Stale UEFI boot entry left behind after reprovisioning #946

bgilbert opened this issue Aug 31, 2021 · 19 comments
Labels

Comments

@bgilbert
Copy link
Contributor

bgilbert commented Aug 31, 2021

On UEFI systems, shim is invoked as the fallback bootloader on first boot and creates a UEFI boot entry. That boot entry encodes the partition GUID of the ESP, so each reprovision of the machine will create a new boot entry and leave the stale one in place. In principle this might eventually fill up flash.

On first boot, we currently randomize the disk's GUID and the filesystem UUIDs of / and /boot, but we do not randomize partition GUIDs. (And we can't randomize the ESP partition GUID then, since it'll break the boot entry just created by shim.) cosa's create_disk.sh does randomize partition GUIDs, however, so each OS release will have different ones. That seems like an awkward middle ground.

Perhaps we should hardcode the partition GUID for the ESP, allowing the existing boot entry to be reused after reprovisioning. That would undoubtedly cause trouble if two disks have FCOS installed, but there are already other reasons the OS won't boot successfully in that case. It would also double down on violating the GPT spec, which wants GUIDs to be globally unique.

xref https://bugzilla.redhat.com/show_bug.cgi?id=1997805
xref https://bugzilla.redhat.com/show_bug.cgi?id=1977983

@dustymabe dustymabe added the meeting topics for meetings label Sep 1, 2021
@dustymabe
Copy link
Member

@bgilbert - does this have overlap with #563 ?

@dustymabe
Copy link
Member

We discussed this at the community meeting today.

13:08:27   dustymabe | #agreed we'll drop the shim fallback .CSV file
                     | (/boot/efi/EFI/fedora/*.CSV) OR pick up a new shim subpackage which
                     | does the same, which will prevent the creation of a new UEFI boot
                     | entry (variable stored in nvram/flash) every time FCOS is
                     | reprovisioned

@dustymabe
Copy link
Member

I assume we need to make sure this applies to both the platform artifact images and the live ISO/PXE images (anywhere EFI is used to boot)?

@dustymabe dustymabe added status/decided status/pending-action Needs action and removed meeting topics for meetings labels Sep 1, 2021
@bgilbert
Copy link
Contributor Author

bgilbert commented Sep 2, 2021

I'm not sure yet about EFI PXE boot, but it does apply to the live ISO. However, dropping the CSV file from the ISO's EFI image causes the machine to bootloop.

@bgilbert
Copy link
Contributor Author

bgilbert commented Sep 2, 2021

And yes, #563 looks like shim initializing the boot variable and rebooting. It appears that shim < 15.3 (in some circumstances?) then reboots rather than falling through to GRUB, while ≥ 15.3 offers a menu with the choice to keep going.

@bgilbert
Copy link
Contributor Author

bgilbert commented Sep 3, 2021

@vathpela It looks like when the CSV file is removed, shim 15.4 boot-loops in the Boot Option Restoration screen.

@travier
Copy link
Member

travier commented Sep 3, 2021

See openshift/assisted-installer#229 & openshift/assisted-installer#313

@travier
Copy link
Member

travier commented Sep 3, 2021

See answer in openshift/assisted-installer#229 (comment)

@cgwalters
Copy link
Member

We may need to offer something like coreos-installer install --reset-uefi or so?

@bgilbert
Copy link
Contributor Author

bgilbert commented Sep 7, 2021

It'd be good to avoid adding obscure flags to things if possible. I'm hopeful that #946 (comment) will solve the problem going forward. But it's possible that once we start bumping bootimages in existing OCP clusters, we'll uncover enough badly-behaved firmware that we'll need a workaround for existing deployments. We could probably have a limited-purpose boot variable scrubber that runs automatically as part of install, inside coreos-installer or as a separate unit.

@bgilbert
Copy link
Contributor Author

bgilbert commented Sep 8, 2021

Looking through the shim code, I don't think we want to drop the CSV file; I think we want to drop fallback.efi altogether and have shim chain directly into GRUB. shim explicitly supports doing this when fallback.efi isn't available.

RFC cosa PR at coreos/coreos-assembler#2421, though probably bootupd should handle it instead.

@dustymabe
Copy link
Member

I haven't reviewed the PR, but reading this I thought of one thing that could be a show stopper for this

Looking through the shim code, I don't think we want to drop the CSV file; I think we want to drop fallback.efi altogether and have shim chain directly into GRUB.

I recall pjones saying during our meeting a few weeks back that fallback.efi functionality was going to me merged in to something else soon and so fallback.efi wouldn't exist (i.e. not easy to just drop it). Do you remember that? If so, was that considered here?

@bgilbert
Copy link
Contributor Author

bgilbert commented Sep 9, 2021

Ah, indeed:

16:44:04 <pjones> is it possible to just not include /boot/efi/EFI/fedora/*.CSV in your images?
16:44:18 <pjones> that's where fallback is getting the information from
16:44:25 <bgilbert> pjones: would that disable the behavior?  I actually haven't looked
16:44:48 <pjones> it should, yes.  removing fallback.efi would also do it currently, but that particular file will go away in the future.

If we were only working around this in cosa, I think it'd be okay to just add a test and then fix up cosa once shim is refactored. But I've been thinking we should handle this in bootupd instead, and I'm more hesitant to make code changes there if we expect the invariants to change later.

@bgilbert
Copy link
Contributor Author

I chatted with pjones, and his advice is to delete both fallback.efi and the CSV, bearing in mind that fallback.efi may not be present.

@dustymabe
Copy link
Member

Where does this stand? From coreos/coreos-assembler#2421 (comment) it looks like half of it is done but we still need to do some work in bootupd?

@bgilbert
Copy link
Contributor Author

bgilbert commented Nov 9, 2021

Yup, that's correct. We've removed fallback.efi from the ISO but not yet from the regular disk image.

@dustymabe
Copy link
Member

Is there a tracker issue for the bootupd work (or more importantly, is the work on anyone's radar)?

@bgilbert
Copy link
Contributor Author

I'd say this is the tracker issue. It's on my plate.

aleskandro added a commit to aleskandro/openshift-release that referenced this issue Feb 23, 2023
Some servers' firmware push any new detected boot options to the tail of the boot order.
When other boot options are present and bootable, such a server will boot from them instead of the new one.
As a (temporary?) workaround, we manually add the boot option.
NOTE: it's assumed that old OSes boot options are removed from the boot options list during the wipe operations.
 xrefs: https://bugzilla.redhat.com/show_bug.cgi?id=1997805
        coreos/fedora-coreos-tracker#946
        coreos/fedora-coreos-tracker#947
openshift-merge-robot pushed a commit to openshift/release that referenced this issue Feb 23, 2023
* Support Dell IPMI power commands

On Dell servers, `ipmi power (off|on|reset)` returns errors when the host is in a state that doesn't allow the requested transition. Enforcing two commands (on + off) instead of reset, and ignoring any power off errors to ignore those validation errors.

* Set the efi boot order after installing RHCOS in UPI/UEFI/PXE scenarios

Some servers' firmware push any new detected boot options to the tail of the boot order.
When other boot options are present and bootable, such a server will boot from them instead of the new one.
As a (temporary?) workaround, we manually add the boot option.
NOTE: it's assumed that old OSes boot options are removed from the boot options list during the wipe operations.
 xrefs: https://bugzilla.redhat.com/show_bug.cgi?id=1997805
        coreos/fedora-coreos-tracker#946
        coreos/fedora-coreos-tracker#947
@bgilbert bgilbert removed their assignment May 7, 2023
@bgilbert
Copy link
Contributor Author

bgilbert commented Jun 6, 2023

This isn't actively being worked on. Some old draft pieces of the work:

I'll close those PRs for now, but they're there for future reference if anyone picks this back up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants