-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stale UEFI boot entry left behind after reprovisioning #946
Comments
We discussed this at the community meeting today.
|
I assume we need to make sure this applies to both the platform artifact images and the live ISO/PXE images (anywhere EFI is used to boot)? |
I'm not sure yet about EFI PXE boot, but it does apply to the live ISO. However, dropping the CSV file from the ISO's EFI image causes the machine to bootloop. |
And yes, #563 looks like shim initializing the boot variable and rebooting. It appears that shim < 15.3 (in some circumstances?) then reboots rather than falling through to GRUB, while ≥ 15.3 offers a menu with the choice to keep going. |
@vathpela It looks like when the CSV file is removed, shim 15.4 boot-loops in the Boot Option Restoration screen. |
See answer in openshift/assisted-installer#229 (comment) |
We may need to offer something like |
It'd be good to avoid adding obscure flags to things if possible. I'm hopeful that #946 (comment) will solve the problem going forward. But it's possible that once we start bumping bootimages in existing OCP clusters, we'll uncover enough badly-behaved firmware that we'll need a workaround for existing deployments. We could probably have a limited-purpose boot variable scrubber that runs automatically as part of install, inside coreos-installer or as a separate unit. |
Looking through the shim code, I don't think we want to drop the CSV file; I think we want to drop RFC cosa PR at coreos/coreos-assembler#2421, though probably bootupd should handle it instead. |
I haven't reviewed the PR, but reading this I thought of one thing that could be a show stopper for this
I recall pjones saying during our meeting a few weeks back that |
Ah, indeed:
If we were only working around this in cosa, I think it'd be okay to just add a test and then fix up cosa once shim is refactored. But I've been thinking we should handle this in bootupd instead, and I'm more hesitant to make code changes there if we expect the invariants to change later. |
I chatted with pjones, and his advice is to delete both |
Where does this stand? From coreos/coreos-assembler#2421 (comment) it looks like half of it is done but we still need to do some work in bootupd? |
Yup, that's correct. We've removed |
Is there a tracker issue for the bootupd work (or more importantly, is the work on anyone's radar)? |
I'd say this is the tracker issue. It's on my plate. |
Some servers' firmware push any new detected boot options to the tail of the boot order. When other boot options are present and bootable, such a server will boot from them instead of the new one. As a (temporary?) workaround, we manually add the boot option. NOTE: it's assumed that old OSes boot options are removed from the boot options list during the wipe operations. xrefs: https://bugzilla.redhat.com/show_bug.cgi?id=1997805 coreos/fedora-coreos-tracker#946 coreos/fedora-coreos-tracker#947
* Support Dell IPMI power commands On Dell servers, `ipmi power (off|on|reset)` returns errors when the host is in a state that doesn't allow the requested transition. Enforcing two commands (on + off) instead of reset, and ignoring any power off errors to ignore those validation errors. * Set the efi boot order after installing RHCOS in UPI/UEFI/PXE scenarios Some servers' firmware push any new detected boot options to the tail of the boot order. When other boot options are present and bootable, such a server will boot from them instead of the new one. As a (temporary?) workaround, we manually add the boot option. NOTE: it's assumed that old OSes boot options are removed from the boot options list during the wipe operations. xrefs: https://bugzilla.redhat.com/show_bug.cgi?id=1997805 coreos/fedora-coreos-tracker#946 coreos/fedora-coreos-tracker#947
This isn't actively being worked on. Some old draft pieces of the work:
I'll close those PRs for now, but they're there for future reference if anyone picks this back up. |
On UEFI systems, shim is invoked as the fallback bootloader on first boot and creates a UEFI boot entry. That boot entry encodes the partition GUID of the ESP, so each reprovision of the machine will create a new boot entry and leave the stale one in place. In principle this might eventually fill up flash.
On first boot, we currently randomize the disk's GUID and the filesystem UUIDs of
/
and/boot
, but we do not randomize partition GUIDs. (And we can't randomize the ESP partition GUID then, since it'll break the boot entry just created by shim.) cosa'screate_disk.sh
does randomize partition GUIDs, however, so each OS release will have different ones. That seems like an awkward middle ground.Perhaps we should hardcode the partition GUID for the ESP, allowing the existing boot entry to be reused after reprovisioning. That would undoubtedly cause trouble if two disks have FCOS installed, but there are already other reasons the OS won't boot successfully in that case. It would also double down on violating the GPT spec, which wants GUIDs to be globally unique.
xref https://bugzilla.redhat.com/show_bug.cgi?id=1997805
xref https://bugzilla.redhat.com/show_bug.cgi?id=1977983
The text was updated successfully, but these errors were encountered: