Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flip 298: Utilize Dynamic Protocol State for Version Beacon (coordinating upgrades of the Execution Stack) #296

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

AlexHentschel
Copy link
Member

@AlexHentschel AlexHentschel commented Oct 31, 2024

This flip is tracked by issue #298.

Motivation

We want to move away from existing Version Beacon for the following reasons:

  1. Current Version Beacon requires that nodes have (potentially long) history (have seen version beacon service event, which is not guaranteed for nodes joining at epoch boundaries)

    Better: each block specifies which component version is to be used for processing it

  2. Current Version Beacon based on height and hence not usable for upgrading most consensus-related aspects (any many other protocol aspects).

    Better: using View instead of height for triggering behaviour changes is generally applicable and more robust (view monotonously increases over time, while height might also decrease).

  3. The Current Version Beacon does not sufficiently differentiate between a software version and specified behaviour of how the software should behave and makes very strong assumptions about downwards compatibility, which are too easily broken.

Context

This flip is a draft resulting from the Core Protocol Working Group meeting on Oct 31, 2024.
The original version was drafted in Notion: https://www.notion.so/flowfoundation/Core-Protocol-WG-Meeting-07-October-31-2024-1271aee1232480d0850eced4fafc1596?pvs=4

@AlexHentschel AlexHentschel changed the title [Flip ] Utilize Dynamic Protocol State for Version Beacon (coordinating upgrades of the Execution Stack) [Flip 296] Utilize Dynamic Protocol State for Version Beacon (coordinating upgrades of the Execution Stack) Oct 31, 2024
@AlexHentschel
Copy link
Member Author

P.S. As part of this PR, I have added the markdown [.md] extension to filenames that were previously missing it

@AlexHentschel AlexHentschel marked this pull request as ready for review October 31, 2024 22:40
Copy link
Member

@turbolent turbolent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great proposal! +100 on adopting this for the execution stack and Cadence.

protocol/20241031-execution-stack-versioning.md Outdated Show resolved Hide resolved
protocol/20241031-execution-stack-versioning.md Outdated Show resolved Hide resolved
protocol/20241031-execution-stack-versioning.md Outdated Show resolved Hide resolved
protocol/20241031-execution-stack-versioning.md Outdated Show resolved Hide resolved

- Dynamic Protocol State should ingest Version Beacon Service Event and track’s the Execution Stack’s Component Version

![Illustration of Process](20241031-execution-stack-versioning/Execution_Stack_Versioning_(2).png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot going on in this diagram. Maybe add a description of what is important to this proposal.

Regarding this interface: maybe make this more abstract and explain it. I had a hard time understanding this core piece of the proposal. What is a "KVStoreReader"? What is a "ViewBasedActivator"? Maybe improve that naming and make it less based on current implementation details (Go/flow-go).

Maybe also make it clearer that the idea is that for each component, there will be two functions:

  • A function returning the current required version for this component
  • A function returning future/upcoming versioning for this component

It is not very clear that the example shows the versioning of the component named "ExectionStack".

protocol/20241031-execution-stack-versioning.md Outdated Show resolved Hide resolved
@AlexHentschel AlexHentschel changed the title [Flip 296] Utilize Dynamic Protocol State for Version Beacon (coordinating upgrades of the Execution Stack) Flip 298: Utilize Dynamic Protocol State for Version Beacon (coordinating upgrades of the Execution Stack) Nov 1, 2024
@turbolent turbolent added the flip: protocol Protocol FLIP label Nov 1, 2024
• adding sections `Objectives` and `Motivation` and `User Benefit` according to Flip template
• adding short explanation of terminology (Version Beacon, HCU)
@AlexHentschel AlexHentschel force-pushed the version-beacon-flip_initial-draft branch from b5f8620 to fd9c54d Compare November 1, 2024 23:42
@bluesign
Copy link
Collaborator

bluesign commented Nov 4, 2024

hmm I think semver is not a good idea here, actually I think version alone is bad idea tbh. I think this should be version + an array of feature flags instead.

Comment on lines +150 to +152
- In addition, we introduce compatibility requirement from semantic versioning:

$\textnormal{Component Version :} \quad \texttt{major}\,.\,\texttt{minor}$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this mapping from semver to component versions.

In semver, only major version changes are breaking changes. All other changes must be backward-compatible.

MAJOR version when you make incompatible API changes
MINOR version when you add functionality in a backward compatible manner
PATCH version when you make backward compatible bug fixes

We need to coordinate a component version upgrade only when that component is upgraded in a backward-incompatible manner (major version increment in semver). Otherwise a rolling upgrade suffices.

We may choose to require coordination of a backward-compatible component version upgrade (minor version increment in semver). Maybe we want to coordinate the release of a feature at a specific time. But by doing so, we are turning a backward-compatible upgrade into a backward-incompatible upgrade. Which is fine, but now it is a major version upgrade in semver terms.

So, if some component is using semver to internally version itself, then only major version changes should correspond to component version increments.

Copy link
Member Author

@AlexHentschel AlexHentschel Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to coordinate a component version upgrade only when that component is upgraded in a backward-incompatible manner (major version increment in semver). Otherwise a rolling upgrade suffices.

Maybe I missunderstand - added new section Discussion of Possible Versioning Schemes, which in part discusses your point if I understood it correctly.

In a nutshell, a pure feature addition is still something that needs to be coordinated, because we need to agree when the new feature becomes usable. Nevertheless, one can make an argument for differentiating between major breaking changes and pure feature adds I think. That's all that I want to say here.
Nevertheless, in the end, I agree with you that for many scenarios SemVer odes not make sense to me. I tried to explain that better in the new section Discussion of Possible Versioning Schemes. Please take a look. Curious about your thoughts.

protocol/20241031-execution-stack-versioning.md Outdated Show resolved Hide resolved
@turbolent
Copy link
Member

@bluesign the idea is to keep the solution/mechanism simple. you could imagine that a particular component could translate a version to feature flags internally, so there isn't really any reason to put features into the "component version" that is tracked/coordinated by the protocol.

@turbolent
Copy link
Member

We've discussed this proposal in the Cadence team today. Some feedback based on questions to avoid confusion:
It might make sense to

  • explicitly state that it is not a requirement that a software version must support all previous component versions
  • add an example where a component version is supported by multiple software versions, to illustrate that e.g. any change that does not change the behaviour of execution (i.e. the result, for example a performance optimization), does not require a component version bump

@AlexHentschel
Copy link
Member Author

@bluesign also in response to your suggestion

version alone is bad idea tbh. I think this should be version + an array of feature flags instead.

I have extended section Discussion of Possible Versioning Schemes. It discusses Feature Vectors for versioning in detail. I would agree that certain features we would want to repeatedly turn off and on at runtime and in those cases feature flag make sense.

In most cases we batch updates and for those I think an integer version would just be fine. So I agree with your general assessment of version + an array of feature flags instead. Though, at the moment I am not sure of any feature that we want to repeatedly turn on and off on mainnet. Hence, I'd start with the integer version alone.

@bluesign
Copy link
Collaborator

bluesign commented Nov 5, 2024

I would agree that certain features we would want to repeatedly turn off and on at runtime and in those cases feature flag make sense.

I was more of thinking of rollback scenarios without building new version. Also it can be helpful for stuff that deprecates. ( without releasing a new version ) Also it can make backwards guarantees easier, you can say that you guarantee component to be compatible with 1 previous version, and next one is previous + feature flags.

something like this:

1.0 
1.0 + [feature_A, feature_B]
1.1 ( includes feature_A, feature_B) + [feature_C] 

additional complexity of SemVer with the associated correctness risks (SemVer assumes downwards compatability by default, while maintaining downwards compatability
in the implementation is generally additional work, so the default assumption of compatability induces additional risks for the happy path of block execution - no a good tradeoff in my opinion).

### Versioning Scheme based on Feature Vectors
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bluesign I wanted to consolidate the conversation around feature flags into one thread here, which we can possibly resolve after we reach alignment and transferred the conclusions into the flip. This also allows other contributors to comment on the ideas without different conversations interleaving.


@bluesign's initial comment: hmm I think semver is not a good idea here, actually I think version alone is bad idea tbh. I think this should be version + an array of feature flags instead.


@AlexHentschel reply: I have extended section Discussion of Possible Versioning Schemes. It discusses Feature Vectors for versioning in detail. I would agree that certain features we would want to repeatedly turn off and on at runtime and in those cases feature flag make sense.

In most cases we batch updates and for those I think an integer version would just be fine. So I agree with your general assessment of version + an array of feature flags instead. Though, at the moment I am not sure of any feature that we want to repeatedly turn on and off on mainnet. Hence, I'd start with the integer version alone.


@bluesign's reply: I was more of thinking of rollback scenarios without building new version. Also it can be helpful for stuff that deprecates. ( without releasing a new version ) Also it can make backwards guarantees easier, you can say that you guarantee component to be compatible with 1 previous version, and next one is previous + feature flags.

something like this:

1.0 
1.0 + [feature_A, feature_B]
1.1 ( includes feature_A, feature_B) + [feature_C] 

Copy link
Member Author

@AlexHentschel AlexHentschel Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification @bluesign. It makes sense to me to introduce a feature flag to potentially roll back one specific feature that we are worried about. However, we could also just roll back to the prior version if the software supports that:

• version 6 
• version 7 (extends version 1.0 by adding features A and B) 
  ⋮
• version 6: we learn feature A is bugged and roll back to version 6
• version 8: extends version 1.0 by adding features A and a fixed version of feature B) 

I would guess we share the opinion that many different could potentially be useful for different scenarios. Though, I have been asked repeatedly for "best practise guidelines" on what versioning scheme should be picked when -- which I am struggling with because I also don't think I have the necessary experience to give properly educated suggestions.
That's mainly the reason why I tend to come back to the questions:

  1. Can we not express the same evolutionary path of the protocol by using solely integer versions?
  2. And what is the strong benefit of using a more sophisticated versioning scheme (here SemVer + feature flags)? Sure, it can be useful in specific cases, though a more sophisticated versioning schemes also introduces more intellectual and implementation complexity.

I agree that integer versioning is a comparatively simplistic tool. Though it is also extremely general as it makes effectively no assumptions about usage patterns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I had not realized that this mechanism would support rollbacks 🤔 For whatever reason I assumed that versions would have to increment, even if an higher version effectively might just have the same behaviour as an earlier one, i.e. I had assumed that "version 6: we learn feature A is bugged and roll back to version 6" would rather be a "version 7: we learn feature A is bugged and roll back to the behaviour of version 6, features A and B are disabled"

Copy link
Member

@turbolent turbolent Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not express the same evolutionary path of the protocol by using solely integer versions?
[...] what is the strong benefit of using a more sophisticated versioning scheme

That was mostly the realization in prior discussions: A simple integer is sufficient (e.g. the notion of a supported component version range can be implemented in the component implementation, like software version 2 supports component version range [1,2]) and any benefits (which?) are likely not worth the complexity

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @AlexHentschel , unfortunately I lack the experience in distributed protocols area, but from my experience in mobile game publishing, where deploying a new version is a bit delayed where many users are effected, it is always best to think with "bad version release" in mind.

There are few parameters to consider I guess

  • do we need rollback vs new version release ? (btw I was thinking like @turbolent when I commented, component rollback will be a new version )

I don't know the cost of releasing new version to network, I assume ( from execution context only ) it involves updating and restarting the nodes. This can be covered with version ranges for sure, but I think there would need some guarantees on at least supporting 1 previous version in case things go wrong.

  • what is the component support range should be?

I think one version before is a must, just in case of a failure. version N supporting N-1 is probably enough. ( at least for security patches )

For bigger batches ( features bundled in a version ) it was the origin of the version + flags idea, as all those are parallel to previous version, linear rollback does not let us to remove one feature without removing another. ( not sure how important is this one also to the business case here )

It is really tricky subject, I don't think there is single best practice to follow, it all depends on targets and trade offs.

Copy link
Member Author

@AlexHentschel AlexHentschel Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always best to think with "bad version release" in mind

💯 agree. Especially for high-assurance systems such as flow, it's super important to have a fallback plan, because every software change requires time-intensive coordination and if mainnet were to break it would take on the order of hours to coordinate with a supermajority of node operators to deploy a fix. We thoroughly test, but of course that is not a guarantee that the software behaves as intended.

I really appreciate this discussion, because it helps find a good balance: we want to make our toolset better, but it doesn't have to be perfect. Nevertheless, the improvement of our tooling need to be significant enough to warrant the respective investment of engineering resources. In the context of this Flip, the goal is to extend our toolset for deploying software upgrades and addressing problems with upgrades:

  • I personally think that already Integer Version via the Dynamic Protocol State for some (few) software components would be a significant step forward. We are going to learn a lot about for which upgrade scenarios our new approach is working well and for which scenarios there are challenges. Furthermore, integer versioning is simple and the most general, so it makes sense to me to start with that.

  • Nevertheless, @bluesign is raising very good points: What mechanisms do we need in addition, for integer versioning to be a significant improvement? Paraphrasing (my understanding) of @bluesign's previous points:

    (i) Should we recommend that software supports version N plus N-1, to keep the option for quick rollback without requiring time-intensive software upgrades on supermajority of nodes?
    (ii) How should be represent rollback of new version in the Component Versioning Scheme?
    (iii) In case of releasing a bigger batch of features, do we want the ability to roll back select features without needing to reverse the entire upgrade as a whole?

    (Hope I understood correctly 😅, don't hesitate to jump in @bluesign)

For the points (i)-(iii), the software implementation as well as the component versioning scheme have to support it.

regarding (i):
Whether we can make the implementation work (one software supporting component version [N-1, N]) will strongly depend on the feature we want to upgrade/add and the component it lives in. We should recommend support for component versions [N-1, N] but not depend on it always being possible. I feel in many cases, it will be possible for the implementation to at least support component versions [N-1, N]. So for this flip, I think we should consider this scenarios explicitly.

regarding (ii): (rollback of new version as a whole)
I think for this scenario, I think integer versioning is sufficient. Lets walk over the two important sub-cases:

  • The software only supports version N, but some time after deployment we discover there are problems with the new behaviour and we need to roll back. We need to deploy a new software to a majority of nodes, because the only behaviour the software supports is the bricked behaviour. Then, it doesn't really matter, whether we increment or decrement the version number for the following reason:

    Example: Lets say we just deployed component version N=6 and now find out there are severe problems. If we switch back to exactly the same behaviour of version 5, we could decrement the component version from 6 to 5 in the protocol state. The nodes with bricked behaviour will halt at the switchover point, the node operators would have to deploy the version-5-software and restart the nodes, which then proceed from the switchover point. Nevertheless, there is also no strict reason why two component specifications couldn't behave identically. In other words, we could specify version 7 to behave exactly as 5 and then we could increase the version number from the bricked version 6 to 7. The deployment process for the node software would be exactly the same.

  • The other scenario is that our software supports component versions [N-1, N]; we just switched to version N=6 and now find out about the problems. But differently to the prior case, our software still supports the old component version 5. Then we could quickly switch back to version 5 via a governance transaction without needing to touch the software on the nodes. In contrast, if we hypothetically chose to increase the version to 7 to remove the bricked behaviour, we would need new software. So decrementing the version number is clearly advantageous in my opinion.

Hence, we conclude: decrementing the component version is generally suitable to rollback an upgrade as a whole. There are exceptions, but I would recommend this approach as the default.

regarding (iii): (rollback of new selective features of a new version)
So we have seen that the Integer Versioning Scheme works well for rolling back to a prior version (abandoning the new version as a whole). However, it doesn't allow us to roll back selective features (arbitrary combinations) that were all deployed as one new version. I am not sure how prominent this scenario will be - my gut feeling is that we need this fine-grained control only in a minority of cases. Hence, I am inclined to stick with Integer Versioning Scheme and think about optional extensions for this edge-case scenario (👇 proposal below).

Here is my proposal: "Integer Versioning with Variable Feature Vector Extension"

The following could be the format of a version

type ExtendedIntegerVersion struct {
	N uint64
	FeatureControls []byte // by default empty or nil
}

With empty FeatureControls, this collapses to the pure Integer Versioning Scheme.

FeatureControls is a binary vector, which is solely interpreted by the software. There is one and only one way how to interpret the FeatureControls, which is technically part of the component behaviour specification. So software that understands component version N can decode and interpret FeatureControls for exactly that version. In the most simplistic form, every bit could correspond to a particular feature flag for the component. Though, we can also put a lot more complex structures in this FeatureControls slice. FeatureControls should be small in size, that's the only limitation. But otherwise, it is essentially a completely generic binary blob, which carries a version-specific configuration for the component.

@AlexHentschel AlexHentschel force-pushed the version-beacon-flip_initial-draft branch from 97f19b6 to cba3277 Compare November 5, 2024 21:09
@AlexHentschel
Copy link
Member Author

@turbolent, thanks for discussing the flip with the Cadence team and providing a summary:

• explicitly state that it is not a requirement that a software version must support all previous component versions
• add an example where a component version is supported by multiple software versions, to illustrate that e.g. any change that does not change the behaviour of execution (i.e. the result, for example a performance optimization), does not require a component version bump

I have extended the section Relationships between Software and Component Version and added explanations of both points. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flip: protocol Protocol FLIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants