appsconfig.proto: Add edge node cluster defines #46

famleebob · 2024-03-12T04:04:17Z

First pass adding edge node cluster to the API

Based on the design and API in https://wiki.lfedge.org/display/EVE/Edge-Node+Clustering+Design

proto/config/appconfig.proto

proto/config/edge_node_cluster.proto

proto/config/storage.proto

proto/config/edge_node_cluster.proto

eriknordmark · 2024-03-12T21:23:01Z

Part 1 - first pass adding edge node cluster to the API

see ~~internal-removed~~ March 8, 2024 at 9:54 AM.

It would make sense to have a separate document focusing on the cluster configuration and the API and publish that on the LF-Edge EVE wiki.
Then doing a careful design review of the API additions to make sure we are not defining items before we are going to use them, and also that the descriptions/comments are clear enough for an implementor of e.g., adam/eden to use the new API.

famleebob · 2024-03-12T22:37:01Z

I'll hold off creating the generated files until we have agreement on the modified files.

naiming-zededa · 2024-03-13T00:19:24Z

Part 1 - first pass adding edge node cluster to the API
see ~~internal removed~~ March 8, 2024 at 9:54 AM.

It would make sense to have a separate document focusing on the cluster configuration and the API and publish that on the LF-Edge EVE wiki. Then doing a careful design review of the API additions to make sure we are not defining items before we are going to use them, and also that the descriptions/comments are clear enough for an implementor of e.g., adam/eden to use the new API.

this one is posted last week on lfedge wike: https://wiki.lfedge.org/display/EVE/Edge-Node+Clustering+Design

zedi-pramodh · 2024-03-13T00:55:31Z

@famleebob I think we discussed not to put internal docs on public domain

deitch · 2024-03-13T13:37:36Z

@famleebob I think we discussed not to put internal docs on public domain

@zedi-pramodh I updated the comments to remove links to internal docs.

Erik is right, though, that we need a public design doc on LF-Edge Wiki or in this repo for a design, so good link to the wiki one.

proto/config/edge_node_cluster.proto

famleebob · 2024-03-13T14:33:31Z

@famleebob I think we discussed not to put internal docs on public domain

Yes we did, my bad. Just wanted to tie the work done to a specific description of the API

proto/config/edge_node_cluster.proto

proto/config/appconfig.proto

eriknordmark · 2024-03-14T14:08:30Z

@famleebob I think we discussed not to put internal docs on public domain

@zedi-pramodh I updated the comments to remove links to internal docs.

Erik is right, though, that we need a public design doc on LF-Edge Wiki or in this repo for a design, so good link to the wiki one.

I took the liberty to add that link to the PR description.

eriknordmark · 2024-03-20T08:58:12Z

proto/config/appconfig.proto

+
+  // This edge-node UUID for the Designate Node for the Application
+  string designated_node_id = 24;


Still not clear we need this in the API as opposed to the cluster selecting one control-plane node to do the download and volume creation. If we do it inside the cluster it means that the cluster can re-select the node should the originally selected one fail. Doing it in the API means that the controller (or the user??) now has to react to the failed node and based on that change the configuration it sends to the nodes.

From I understand this longhorn persistent provisioning, this will be used to initial setup who should perform this task. Even later this node fails, this does not change.
Another way to do this, to allow cluster to decide, but in general we need to send down a set of nodes for the replication (even though initially we only have 3 nodes), each VI would need to have some election going on among the nodes. That can also be complex.

sorry, i thought this part is the volume instance.

for app instance here. We later may want to allow 'manual' provisioning from the user, so this will be the same scheme, let api to indicate which node should run the app.

eriknordmark · 2024-03-20T08:59:10Z

proto/config/storage.proto

+  // To inform the edge-node if the device receiving this Volume is
+  // responsible for volume creation, convert PVC, or not
+  string designated_node_id = 10;


See comment about this field for the app instance.

i put the reply in above app-instance comment.

FWIW I still think this will cause technical debt since the controller has to be involved in re-selecting a designated node if the previously selected one fails. So approving this holding my nose.

eriknordmark · 2024-03-20T09:05:14Z

proto/info/info.proto

+  Node_Status_READY = 1;    // cluster reports our node is ready
+  Node_Status_NOTREADY = 2; // cluster reports our node is ready


Comments are identical for the two fields.

But how does the state of the node relate to the state of the cluster?
Take the case of a cluster where its config lists node1, node2, and node3.
When e.g., node3 reports READY does it mean that not only it has joined the cluster but also that it sees node2 and node1 as part of the cluster?

What status do we expect to see reported by the nodes if/when one of the nodes fail to join the cluster? Will the other two report it as READY?

These semantics should presumably be described in comments in the proto file.

I think this is only for the node to report the status as part of the cluster. We'll need another info or metric message to report the cluster status (more like kubernetes view) w/ an elected reporter on the cluster. But that is more complex and is not part of this PR.
So, this is only saying as a device, my node in this cluster status. If all nodes report 'unknown' then the cluster is down.

Can you 1) correct the comment and 2) add the above information as comments.

Also, does it make sense to have a separate value for DOWN vs. using UNKNOWN? Either one works for me as long as this is clearly documented.

Will correct the comment.
We can use 'DOWN' or 'CANTCONNECT', basically it says can not talk to the api-server of the cluster.

eriknordmark · 2024-03-20T09:09:53Z

proto/info/info.proto

+  // The App in cluster mode is currently running on this edge-node
+  bool cluster_app_running = 20;


I think I already commented on this field but I don't see the comment any more.
Does "running" imply that 1) this node was selected as the place to run the app instance and 2) the app instance has reached "Online" state?
What about the cases when the node has been selected but the state of the app instance is e.g., booting or halting, halted, error?

It would be helpful to know the intended use (who is going to do something based on this field? Merely for display? Or will a user/controller somehow react to this value to change things?)
Once we know that the semantics can be documented in the proto file.

This info status, main usage is for App 'migration'. the node is not a designated node for the app (related to above app-instance dnid of the app), but kubernetes cluster migrated this app on to this node. The app is running now on this node. controller can now set API of app-instance changes to indicate now the designated node is on this node. All the app domain operations will work normally on this node. e.g. some 'activated' bool will be set to true on the domainstatus, etc.

I think this design with designated is problematic since you depend on the controller to adjust the config based on where the app (or volume) is currently running. What if the controller does not adjust it? Should EVE interpret that as it should be migrated back? (Well, it can't since that would be a race in the eventual consistency model.)

Thus I think the placement needs to be based on policy statements (only run on node X, avoid running on node Y, run on same node as app 2) and not involve the controller reacting to where the app insance is currently running; that is not a declarative API statement.

This is just an 'info' message to report this app now start running on this 'node', it does not involve with any placement decision, since it's already being rescheduled by the cluster itself.
The node does not depend on the controller to change the DNID for the app in configure download, the plan is the have the controller eventually change the DNIS to this node, and the App status on this new node will be sorted of becomes 'legal', and start to report App metrics those sort of stuff. Since the reporting of the metrics depends on domainstatus of the app is fully activated.
I think regardless of how the detail of the process on this node, to report this 'app is running on this node' as an 'info' message is important for controller to know, or for user to know.

@naiming-zededa We still need to make it clear what "running" means here, in cases where the node is trying to run it (so the state could be booting) or tried to run it (and it failed so the state is currently haling, halted, or the app instance has an error).

Thus is this "running" as in "trying to run and state is currently online" or "selected" as in trying/intending to run on this node? Please clarify in comments and in the name of the field.

@eriknordmark makes sense. we will add comments for this item, for the meaning of the App deployed is being scheduled and launched on this node and it has the property of 'node-id' of this node, the App can be in any operating state.

proto/info/info.proto

proto/config/edge_node_cluster.proto

deitch · 2024-03-21T04:39:44Z

proto/config/edge_node_cluster.proto

+  string cluster_ip_prefix = 4;
+
+  // This device is an 'Agent' node
+  bool is_agent = 5;


Should this be is_agent or is_worker? I think maybe worker? The agent/server language is a k3s one; control plane/worker is generic Kubernetes. We are using k3s here, but that is an implementation detail. I think the API should be generic to Kubernetes.

Sure. we can change it to is_worker_node. although we have things like join_server_ip, we don't want to be too kubernetes as join_control_plan_node_ip, and 'master' should not be used these days :-)

Yeah. I think server is fine though. Lots of people use "server" as thing to talk to, and even refer to "control plane server" or "master server". Besides, the only thing I can join is a control plane server anyways.

This still needs to be addressed

Thanks! I missed it yesterday. Done, now is is_worker_node.

I'll do another rebase when ready.

eriknordmark

You've done some strange pull+merge from main so now other PRs appear in the diffs.

Note that you should NEVER do merge commits. Instead fetch from main and rebase on that.
And for this particular repo with the protobuf files it makes sense to always keep two separate commits: one with the changes to the proto files and a separate one with the generated files. That makes it trivial to rebase since you can drop the second commit (in git rebase -i for instance), and then invoke make to recreate the generated files once you've rebased the commit with the proto changes on top of main.

So please rebase and create the two commits which are needed.

famleebob · 2024-03-26T21:58:53Z

I tried to pull in the conflict using a cherry-pick; assuming that git would track it/identify it/ and ignore it on a rebase. Didn't work.

Completed the rebase, now at two commits, and a forced push.

deitch · 2024-03-28T13:51:10Z

Looks good to me, other than Erik's requests (the multiple commits).

famleebob · 2024-03-28T20:57:44Z

Fixed Yetus errors, commit changes, rebase, build, commit generated files.

Should answer all outstanding requests

deitch · 2024-03-28T21:11:45Z

LGTM, but @eriknordmark needs to approve.

famleebob · 2024-03-28T22:36:59Z

Quick update to try to fix the Yetus errors, did I shoot too soon?

famleebob · 2024-04-05T16:41:47Z

You've done some strange pull+merge from main so now other PRs appear in the diffs.

Note that you should NEVER do merge commits. Instead fetch from main and rebase on that. And for this particular repo with the protobuf files it makes sense to always keep two separate commits: one with the changes to the proto files and a separate one with the generated files. That makes it trivial to rebase since you can drop the second commit (in git rebase -i for instance), and then invoke make to recreate the generated files once you've rebased the commit with the proto changes on top of main.

So please rebase and create the two commits which are needed.

I committed these changes shortly after requested, can we at least mark this done, or not blocking the merge, please?

eriknordmark · 2024-04-05T19:11:47Z

proto/config/appconfig.proto

+  // This edge-node UUID for the Designate Node for the Application
+  string designated_node_id = 25;


FWIW I still think this will cause technical debt since the controller has to be involved in re-selecting a designated node if the previously selected one fails. So approving this holding my nose.

eriknordmark

There are still two outstanding clarification requests, then I will approve this holding my nose.

eriknordmark

LGTM.
You probably have to redo the generated files commit since I just merged another API adddition.

first pass adding edge node cluseter to the API fix some Yetus errors Signed-off-by: Gerald (Bob) Lee <[email protected]>

All generated files after `make` Signed-off-by: Gerald (Bob) Lee <[email protected]>

PR lf-edge#46 accidently added allow_to_discover field to app_instance which was removed in lf-edge#52 Not deprecating field number 25 (allow_to_discover) because we haven't had released EVE with this API yet, so we don't have to do this Signed-off-by: Pavel Abramov <[email protected]>

PR #46 accidently added allow_to_discover field to app_instance which was removed in #52 Not deprecating field number 25 (allow_to_discover) because we haven't had released EVE with this API yet, so we don't have to do this Signed-off-by: Pavel Abramov <[email protected]>

famleebob requested a review from eriknordmark as a code owner March 12, 2024 04:04

eriknordmark reviewed Mar 12, 2024

View reviewed changes

proto/config/appconfig.proto Outdated Show resolved Hide resolved

eriknordmark reviewed Mar 12, 2024

View reviewed changes

proto/config/edge_node_cluster.proto Outdated Show resolved Hide resolved

eriknordmark reviewed Mar 12, 2024

View reviewed changes

proto/config/edge_node_cluster.proto Outdated Show resolved Hide resolved

eriknordmark reviewed Mar 12, 2024

View reviewed changes

proto/config/storage.proto Outdated Show resolved Hide resolved

eriknordmark reviewed Mar 12, 2024

View reviewed changes

proto/config/edge_node_cluster.proto Outdated Show resolved Hide resolved

eriknordmark reviewed Mar 12, 2024

View reviewed changes

proto/config/edge_node_cluster.proto Outdated Show resolved Hide resolved

naiming-zededa self-requested a review March 13, 2024 00:20

famleebob force-pushed the EVE_edge_cluster branch 2 times, most recently from cb75589 to b104f42 Compare March 13, 2024 01:26

eriknordmark requested a review from deitch March 13, 2024 10:06

milan-zededa reviewed Mar 13, 2024

View reviewed changes

proto/config/edge_node_cluster.proto Show resolved Hide resolved

naiming-zededa reviewed Mar 13, 2024

View reviewed changes

proto/config/edge_node_cluster.proto Outdated Show resolved Hide resolved

proto/config/edge_node_cluster.proto Outdated Show resolved Hide resolved

proto/config/appconfig.proto Outdated Show resolved Hide resolved

eriknordmark reviewed Mar 20, 2024

View reviewed changes

proto/info/info.proto Show resolved Hide resolved

milan-zededa reviewed Mar 20, 2024

View reviewed changes

proto/config/edge_node_cluster.proto Show resolved Hide resolved

milan-zededa reviewed Mar 20, 2024

View reviewed changes

proto/config/edge_node_cluster.proto Show resolved Hide resolved

deitch reviewed Mar 21, 2024

View reviewed changes

eriknordmark requested changes Mar 25, 2024

View reviewed changes

famleebob force-pushed the EVE_edge_cluster branch from 86fb9db to dbf886d Compare March 26, 2024 21:46

deitch approved these changes Mar 28, 2024

View reviewed changes

famleebob force-pushed the EVE_edge_cluster branch from b0b81da to 8d19735 Compare March 28, 2024 20:54

famleebob requested a review from eriknordmark April 1, 2024 21:40

famleebob force-pushed the EVE_edge_cluster branch from 9c12cbc to e41e113 Compare April 2, 2024 18:16

eriknordmark reviewed Apr 5, 2024

View reviewed changes

eriknordmark requested changes Apr 5, 2024

View reviewed changes

famleebob force-pushed the EVE_edge_cluster branch from e41e113 to 65320ce Compare April 9, 2024 05:37

eriknordmark approved these changes Apr 19, 2024

View reviewed changes

famleebob added 2 commits April 20, 2024 23:19

Add edge node cluster defines

94ef89e

first pass adding edge node cluseter to the API fix some Yetus errors Signed-off-by: Gerald (Bob) Lee <[email protected]>

proto: autogenerated protocol files

ec0f94e

All generated files after `make` Signed-off-by: Gerald (Bob) Lee <[email protected]>

famleebob force-pushed the EVE_edge_cluster branch from 65320ce to ec0f94e Compare April 21, 2024 06:33

eriknordmark merged commit 0b8c63c into lf-edge:main Apr 21, 2024
3 of 4 checks passed

uncleDecart mentioned this pull request Apr 22, 2024

Fix allow to discover #54

Merged


		// This edge-node UUID for the Designate Node for the Application
		string designated_node_id = 24;

		Node_Status_READY = 1; // cluster reports our node is ready
		Node_Status_NOTREADY = 2; // cluster reports our node is ready

		// The App in cluster mode is currently running on this edge-node
		bool cluster_app_running = 20;

		// This edge-node UUID for the Designate Node for the Application
		string designated_node_id = 25;

appsconfig.proto: Add edge node cluster defines #46

appsconfig.proto: Add edge node cluster defines #46

Conversation

famleebob commented Mar 12, 2024 • edited by eriknordmark Loading

eriknordmark commented Mar 12, 2024 • edited by deitch Loading

famleebob commented Mar 12, 2024

naiming-zededa commented Mar 13, 2024 • edited by deitch Loading

zedi-pramodh commented Mar 13, 2024

deitch commented Mar 13, 2024

famleebob commented Mar 13, 2024

eriknordmark commented Mar 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naiming-zededa Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

famleebob Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

eriknordmark left a comment

Choose a reason for hiding this comment

famleebob commented Mar 26, 2024

deitch commented Mar 28, 2024

famleebob commented Mar 28, 2024

deitch commented Mar 28, 2024

famleebob commented Mar 28, 2024

famleebob commented Apr 5, 2024

Choose a reason for hiding this comment

eriknordmark left a comment

Choose a reason for hiding this comment

eriknordmark left a comment

Choose a reason for hiding this comment

famleebob commented Mar 12, 2024 •

edited by eriknordmark

Loading

eriknordmark commented Mar 12, 2024 •

edited by deitch

Loading

naiming-zededa commented Mar 13, 2024 •

edited by deitch

Loading

naiming-zededa Mar 20, 2024 •

edited

Loading

famleebob Mar 27, 2024 •

edited

Loading