-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LV node migration #314
Comments
Thanks for detailing very clearly. |
I did consider it, yes. What wasn't clear to me there: Do I get the guarantee that either pods will only be placed on nodes with existing replicas or will replicas automatically be moved to the pod? So, can I be sure that IO is always local? Because in some situations it would be worse to have an application member with significantly worse IO latency in the cluster, than to just not have the member available. |
Also: I assume mayastor guarantees consistency between replicas, which forces some write overhead, because the write must be replicated to at least one other replica. Not sure if async/eventually consistent replication is supported. |
So my priority for these deployments is good write latency, which makes synchronous replicated storage basically a no-go. Async replication would be viable to speed up application recover, as it may not need need to start its recovery from scratch. I see these types of applications:
Applications of type 1 would need a full copy of the old volume to restore a member. In case of a node failure that would not be possible. So these would need synchronous replication anyway, but I don't know of an example of an application like this and I'm not convinced there exists one. Applications of type 2 would need some parts of the state to be able to restore the rest, I guess this will usually be some form of cluster membership/peer information similar to what qdrant does. These applications need some form of replication to recover a node failure, though these parts of the state don't see as much IO it might be fine to use async replication. I could also imagine splitting up the volume into the part that uses sync replication with mayastore and the part that uses localpv for the best latency. That would effectively turn part of the application into type 3 and part of it into type 1. Applications of type 3 can restore cluster membership without any state, so these would be fine with just simply deleting the state entirely. They might benefit from async replication to be able start migrated pods quicker, but they don't need it. Imagine a MinIO node with terabytes of data. So considering node failure, only type 3 applications would be able to work without some form of replication and these are the applications that are also fine with just deleting the PV. I think with this in mind this feature request could be reduced to a simple option to disable the LV <--> node pinning that's currently happening. Mayastore or some other replicated storage system would be required for the other application types anyway. |
Yes, Mayastor, being a block storage solution, needs to maintain strict consistency between all replicas of a volume, which calls for synchronous replication. |
Related issue: openebs/dynamic-localpv-provisioner#87 |
When I was reading through the initial requirements, I wondered if Mayastor wouldn’t be able to to do the trick. Let’s say a pod is running on node 1 with a locally attached PV and you want to transition to node 2 where currently is no PV locally present. Then with Mayastor you should be able to start the pod on node 2 with NVMe/TCP connection back to node 1, add a replica on node 2, and after it is in sync with the replica on node 1 you retire the replica on node 1 to have the pod access the data on node 2 locally again. I think we can add/remove replicas to a Mayastor PV at any time and so should give you the mobility you are looking for. |
I can confirm this works - using Mayastor to replicate/migrate data within a cluster and remove disks from a pool without any downtime. I decommissioned the node by first increasing the replication factor of the PVs to 2, forcing another node to replicate the data, and finally draining the original node. Unfortunately "deleting" the |
So running mayastor with replication factor 1 guarantees local/non-network IO ? |
There are few things one needs to be aware of:
|
Thanks for the details explanation. In that case mayastor wouldn't serve my use-case. |
May I ask the reason? |
@avishnu as I wrote in the first section of the issue description: I'd like to avoid network IO for applications that require high IO performance and replicate state themselves. I would also prefer a node of e.g. a database cluster not to run rather then having it run with bad performance. So all the feature that make mayastor probably great as a replicated block store don't help here. |
In case the application is scheduled on storage nodes and with a replication count of 1, then the volume stack is all placed on the same node. We also have UBLK on the roadmap, which would allow us to connect to the volume in a more efficient way and without requiring a tcp connection, in fact no network at all. |
Awesome, thanks for the feedback @cbcoutinho! |
Describe the problem/challenge you have
I'm hosting various clustered and stateful applications in kubernetes. Some of these applications require low-latency IO to perform well, like databases and message queues, that's way I use local PVs for these applications, which works great. This way I can put very fast SSDs into these servers and use them without network overhead.
My only pain-point with this setup is (unsurprisingly): The pods, once scheduled, are pinned to their node forever. The only way to move the pod is to delete both the PVC and the pod and hope that the scheduler doesn't decide to put it back onto the same node (sure, this can be helped with node selectors, affinities, anti affinities and taints, but that's even more complexity). An additional, possibly more serious depending on the application, is the fact that node failures can't be recovered from automatically. Even if the application is able to restore its state from remaining peers in its cluster, kubernetes won't execute the pod because it's pinned to a node that's unavailable.
Describe the solution you'd like
Currently, at least that's my current understanding, when kubernetes schedules the pod it works like this (simplified):
The former means that lvm-localpv will create a LV on the node that's selected for the pod, the latter means k8s places the pod on the single node carries that LV that has been eagerly created. Either way, it ends with a pod pinned to a node.
What I would love to see is to make an LV available to all nodes in the cluster independent of where it is physically placed. If the LV is already allocated on a node and kubernetes happens to pick a different node, then just create a new LV on the new node, transfer the LV content over the network and delete the old LV. If the LV does not exist already, then it can simply be created on the node that was picked.
That would obviously significantly delay pod startup depending on the size of the volume and it might require a dedicated high-bandwidth network for the transfer as to not interrupt other communication in the kubernetes cluster, but for application clusters that are highly redundant and can cover a failed replica for a prolonged period, this could be perfectly fine.
And actually this could go one step further: Assuming that the application can restore its state from peers in its cluster, a feasible LV migration strategy would be to create a new empty LV without transferring data and let the application do the "transfer".
I could imagine this as a StorageClass option like
dataMigrationMode
with values:Disabled
(default): current behavior: pin the application to the node with the LVApplication
: Just delete the LV on the old node and create a new one on the new node and let the application handle the migrationVolumeTransfer
: Create a new LV and transfer data to it before mounting it.Anything else you would like to add:
While the VolumeTransfer option would be awesome, it also probably quite involved. So being able to just get a new LV on a new node would probably easier. I guess this also requires applications to be well behaved and deployments well configured to not accidentally delete all the data during a rolling upgrade.
The text was updated successfully, but these errors were encountered: