-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC/RFC: Support reinstalls via kexec
#712
Conversation
Interesting.. Do we have any concerns here about preserving data for filesystems that we want to re-use? i.e. is the kexec safe enough for mounted filesystems that we want to preserve? |
If the filesystems to preserve were created by Ignition and part of the same Ignition passed back to |
We should probably unmount as much as possible and remount everything else RO before doing the kexec. |
Maybe creating a special systemd target with a similar behavior as shutdown could help here. |
We use Edit: I think my demo output above was a bit misleading because I didn't clarify that I've edited out a lot of stuff. I've attached the full logs now; you can see there systemd shutting down before we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow cool stuff, I should clearly be watching PRs to coreos-installer more.
I think the code here would be a lot smaller if we only supported stream metadata.
I'd also say I'd be a lot more confident in offering this when it has an e2e test.
arch: &str, | ||
retries: FetchRetries, | ||
) -> Result<(Url, Url)> { | ||
let release_meta_url = format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this covered by https://github.com/coreos/stream-metadata-rust ?
Oh, I see the comment above.
But...hum, shouldn't we mainly support re-installing from a well known stream (i.e. usually stable
)?
The use case of reinstalling from a specific release seems like it can be covered by having the user manually provide a URL to a build.
That would allow them to use a specific release from the FCOS prod builds (until we GC them) and would allow them to use mirrored builds more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stream-metadata-go already has structs for release metadata, with a comment warning that they shouldn't be used. We could do that in -rust also.
But yes, I agree. I understand the desire to reinstall the release that's currently running, but it's also odd not to follow our own recommendation to always run the latest release. @cgwalters' suggestion seems like a good compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this when I demo'ed it but not here: the idea with using the same release was that I had a faint hope we could reuse all the artifacts we have on hand to make it a fully local experience. Obviously the rootfs squashfs would've taken some creativity. :)
Ok(f) | ||
} | ||
|
||
// XXX: copied from Zincati; need https://github.com/coreos/rpm-ostree/issues/2389 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah...I guess what we should really do is take the hit of splitting the bindings out into a separate git repo, publish as a crate, but have rpm-ostree pull it in from git main.
That would just make it hard to add a new required field, but we can live with that I think.
)?; | ||
|
||
eprintln!("Pivoting"); | ||
runcmd!("systemctl", "kexec")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kexec is great but I think can sometimes have bugs, so it might be useful to think about a mode that instead writes the kernel/initrd to e.g. /boot/coreos-installer-reprovision/{kernel,initramfs.img}
and updates the bootloader entries to use them manually. Then we can do the reinstall from the initramfs on the next boot.
The hit is a double reboot, but it avoids kexec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, possible. One thing worth mentioning though is that this strategy could be useful not just for reinstalling CoreOS from CoreOS, but any situation in which you're running on top of the same block device you'd like to install to. So I could imagine this being used in the container workflow too. In which case, not having to fiddle with boot configuration is a plus because you don't know what the setup is (and of course reinstall
doesn't really make sense in that context; maybe would belong better under install
with a switch?).
Can you explain how it works from a high level perspective? I.e. what will the result be after the reinstall: Will the root partition be completely reset as if CoreOS was installed to a blank drive but additional filesystems (like a separate |
Yup exactly. The idea is to be equivalent to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This looks promising.
@@ -410,6 +410,50 @@ pub fn download_to_tempfile(url: &Url, retries: FetchRetries) -> Result<File> { | |||
Ok(f) | |||
} | |||
|
|||
// XXX: try to dedupe with write_image? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd favor just using write_image
, or wrapping it if we want a convenience function. It looks like this implements a subset of what write_image
supports?
let mut sig_url = url.clone(); | ||
sig_url.set_path(&format!("{}.sig", url.path())); | ||
download_and_verify_to_tempfile(url, &sig_url, retries) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to let UrlLocation
handle the signature fetching, and unify these paths?
arch: &str, | ||
retries: FetchRetries, | ||
) -> Result<(Url, Url)> { | ||
let release_meta_url = format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stream-metadata-go already has structs for release metadata, with a comment warning that they shouldn't be used. We could do that in -rust also.
But yes, I agree. I understand the desire to reinstall the release that's currently running, but it's also odd not to follow our own recommendation to always run the latest release. @cgwalters' suggestion seems like a good compromise.
} | ||
|
||
// Hackily uses https://releases-art-rhcos.svc.ci.openshift.org; this is probably not kosher. | ||
fn get_rhcos_live_urls(version: &str, arch: &str, retries: FetchRetries) -> Result<(Url, Url)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far we've avoided hardcoding distro-specific logic, so this is unfortunate. The bootimage semantics of RHCOS may make it difficult to avoid, though.
Ok((kver_dir.join("vmlinuz"), kver_dir.join("initramfs.img"))) | ||
} | ||
|
||
fn concat_initrds(bottom_initrd: &mut File, top_initrd: &mut File) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're in the process of disrecommending concatenation; xref coreos/fedora-coreos-tracker#1055 (comment). We could teach the initrd to extract the rootfs from a file on a local partition, but that does seem like a lot of extra work.
initrd | ||
}; | ||
|
||
// build kargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said in #712 (comment), let's try to avoid this. I'd rather see us pass an Ignition config with an embedded installer config to the live boot via an appended initrd.
kargs.push_str(" coreos.inst.insecure"); | ||
} | ||
if config.skip_reboot { | ||
kargs.push_str(" coreos.inst.skip_reboot"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could handle this by having the live boot Ignition config create a unit that removes /run/coreos-installer-reboot
.
let mut v = VerifyReader::new(&mut bf, Some(sig.as_slice()), VerifyKeys::Production) | ||
.with_context(|| format!("creating verifier for {}", path))?; | ||
copy(&mut v, &mut std::io::sink()).with_context(|| format!("reading {}", path))?; | ||
v.verify_without_logging_failure() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is intended for try_
paths where we fall back to another code path on signature mismatch. Since we do actually fail here, we should use the logging variant.
config.fetch_retries, | ||
)?, | ||
"rhcos" => get_rhcos_live_urls( | ||
&booted_deployment.version, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To preserve bootimage semantics, shouldn't we be installing the aleph-version and not the current version?
I think before picking this up again, we should determine what kind of factory reset we want and what constraints we want to be able operate within. Then we can see how to adapt this best given those constraints. |
Re-opened in #791. |
Edit: moved to #791.
This is a proof-of-concept showing how we can use
kexec
to support reinstalling assuming we have access to the rootfs image (this is strongly related to or exactly the same as coreos/fedora-coreos-tracker#399 depending on which flavour of "reset" we're talking about).Demo:
(I've cut out a lot of stuff from the logs there to concentrate on the interesting bits; see this attachment for the full output if you're curious.)
It also supports specifying the target Ignition config, image URL, using a local live initramfs and live rootfs, etc...
FCOS is also supported though for there seems to be a bug in the kexec code there.