Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve NFS home directory restore documentation #5291

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

sunu
Copy link
Contributor

@sunu sunu commented Dec 17, 2024

Also adds the option to mount an EBS volume to a pod through the deployer exec root-homes command.

Follow up for #5286. Related to #5061.

@yuvipanda @sgibson91 - This is still a draft and I haven't really tested this but does the overall approach look ok?

Also adds the option to mount an EBS volume to a pod through the
deployer exec root-homes command.

This comment was marked as off-topic.

@@ -25,18 +25,24 @@ To restore a home directory from a snapshot, we need to create a new EBS volume
Please follow AWS's guidance for [restoring EBS volumes from a snapshot](https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html#restore-files) to create a new EBS volume from the snapshot.
```

Once we have created a new EBS volume from the snapshot, we need to mount it to the EC2 instance attached to the existing EBS volume containing the NFS home directories. To do this, we need to find the instance ID of the EC2 instance attached to the existing EBS volume. This involves the following steps:
Once we have created a new EBS volume from the snapshot, we can use the `deployer exec` command to mount the new EBS volume to a pod along with the existing NFS home directories volume.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Once we have created a new EBS volume from the snapshot, we can use the `deployer exec` command to mount the new EBS volume to a pod along with the existing NFS home directories volume.
Once we have created a new EBS volume from the snapshot, we can use the `deployer exec root-homes` command to mount the new EBS volume to a pod along with the existing NFS home directories volume.

@sgibson91 sgibson91 added the deployer:skip-deploy Skips deployment of anything (support, staging, prod) label Dec 18, 2024
@sunu sunu marked this pull request as ready for review December 19, 2024 06:24
@sunu sunu requested review from sgibson91 and yuvipanda December 19, 2024 06:25
@sunu
Copy link
Contributor Author

sunu commented Dec 19, 2024

@sgibson91 I did another test run to test the "wait for the pod to be ready" command that we've added. Everything worked as expected. I can merge this once you give your approval.

@sgibson91
Copy link
Member

I've pinged Yuvi on this too, because I want to make sure his concerns that caused him to reopen #5061 are addressed

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Some minor suggested changes but otherwise this looks pretty close.

While not necessarily done as part of this PR, I do think it's important for us to fully end to end test the restore process at least once before calling it done (see pt 7 of https://mastodon.social/@nixCraft/113671487670162948). So let's make sure that at least one person tries all these steps top to bottom once, and verifies the completeness of the whole process.

Exciting to have this be so close to done!

@@ -37,6 +37,15 @@ def root_homes(
extra_nfs_mount_path: str = typer.Option(
None, help="Mount point for the extra NFS share"
),
restore_volume_id: str = typer.Option(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this additional_ebs_volume_id or something like that, as the code itself doesn't do anything to do with restoring - just mounting additional ebs volumes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use additional_volume_id to keep it platform agnostic? I can add in the help text that only EBS volumes are supported for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me for now. Of course, we can always update things as we begin work on other platforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya i leave it to you both for now!

help="ID of volume to mount (e.g. vol-1234567890abcdef0) "
"to restore data from",
),
restore_mount_path: str = typer.Option(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on restore_volume_id about using 'restore' in the name. I'd say this could be additional_ebs_volume_mount_path?

"kind": "PersistentVolume",
"metadata": {"name": pv_name},
"spec": {
"capacity": {"storage": restore_volume_size},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary for us to specify? I worry that if we get this wrong, it will trigger the kubernetes volume resizing feature

Copy link
Contributor Author

@sunu sunu Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect storage capacity is a required field. I'll try to confirm whether that's required or not.

Reading this warning note, looks like volume expansion is triggered only when the storage capacity is updated for an existing PVC that is already bound to a PV creating a mismatch between the specified storage capacity of the PVC and the bound PV.

I will confirm what happens when the specified volume on the PV and the PVC is larger than the actual size of the disk and ALLOWVOLUMEEXPANSION is enabled on the storageclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing and storage capacity is indeed a required field.

I also tried passing a larger size than the actual disk size as the argument to see if that triggers a resize. The PV and PVC were labeled with the wrong capacity but the underlying disk was not resized.

@yuvipanda How would you like to move forward?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunu ah, if it doesn't trigger a resize I would say:

  1. Hard code an intentionally wrong number here (as df and other commands we use don't rely on it, instead using other more linux filesystem level calls)
  2. Write an inline comment explaining what is going on, and that it doesn't trigger resize

This would allow us to not have the engineer doing this need to know the filesize.

"accessModes": ["ReadWriteOnce"],
"volumeName": pv_name,
"storageClassName": "",
"resources": {"requests": {"storage": restore_volume_size}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about specifying sizes

restore_mount_path: str = typer.Option(
"/restore-volume", help="Mount point for the volume"
),
restore_volume_size: str = typer.Option("100Gi", help="Size of the volume"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment later about specifying the size

@@ -140,6 +200,17 @@ def root_homes(
tmpf.flush()

with cluster.auth():
# If we have an volume to restore from, create PV and PVC first
if restore_volume_id:
with tempfile.NamedTemporaryFile(mode="w", suffix=".json") as pv_tmpf:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being created outside the try whose finally block contains the cleanup code, so it's possible that errors here will lead to the PV not being cleaned up. I think this hsould all be moved inside the same try whose finally has the cleanup code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, thank you!

@sgibson91
Copy link
Member

sgibson91 commented Dec 20, 2024

While not necessarily done as part of this PR, I do think it's important for us to fully end to end test the restore process at least once before calling it done (see pt 7 of https://mastodon.social/@nixCraft/113671487670162948). So let's make sure that at least one person tries all these steps top to bottom once, and verifies the completeness of the whole process.

So I did actually do this in a meeting with Tarashish the other day. I created a volume from a snapshot, mounted it to a pod using the deployer exec root-homes command. We verified both volumes were there and were of similar disk size (they were identical because I think the snapshot was less than a day old), and that the username subdirs were the same. The only thing I didn't do was the rsync process itself.

We definitely need to spread all of this knowledge out beyond just me and Tarashish though.

@yuvipanda
Copy link
Member

That's excellent, @sgibson91!

I think once #5291 (comment) is addressed, this can be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployer:skip-deploy Skips deployment of anything (support, staging, prod)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants