-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database encryption key backup / recovery feature for entropy-tss
#1249
base: peg/non-persistant-tss-keys
Are you sure you want to change the base?
Conversation
…equesting recover backup
) | ||
.await | ||
.map_err(|e| { | ||
tracing::error!("Could not make key backup: {}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than giving up here, we could have a loop which attempts to request backups from other tss servers, until we find one which works, and also report the ones who failed to make a backup for us. I would leave this as a followup.
entropy-tss
entropy-tss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work on this 💪
Still making my way through the review of this, will try and finish it up soon 🙏
) -> Result<(), BackupProviderError> { | ||
let tss_account = SubxtAccountId32(sr25519_pair.public().0); | ||
// Select a provider by making chain query and choosing a tss node | ||
let key_provider_details = select_backup_provider(api, rpc, tss_account).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we try and recover our DB at a later point in time and the selected server isn't available anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
If the 'backup provider' unbonds cleanly, we need a way knowing this so we can make another backup. If they just disappear, then maybe through the slashing mechanism we could figure out that they have gone and make another backup. For example, the propagation pallet makes a request with the TSS ids of leaving validator(s), and if they match out backup provider then we make another call to make_key_backup
. Otherwise, well we just cannot recover.
@JesseAbram suggested sending the key out to several other tss nodes: #1247 (comment)
But i want to keep things simple in this PR and then iterate, as deciding how many backups to make is a tricky design decision where theres a security tradeoff.
I think its worth bearing in mind that not being able to recover is only really a big problem if you are a signer, and even then we can tolerate losing n - t keyshares (although we haven't yet figured out the practicalities doing a reshare in that situation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the 'backup provider' unbonds cleanly, we need a way knowing this so we can make another backup.
If we put the onus onto the TSS here, they'd need to be querying the chain at every session change or anytime a validator changes, which might be annoying. If we make it the chain's responsibility to inform TSS servers then we might have to put backup information on-chain, or just send a request to all TSS servers informing them of this (even if it doesn't necessarily affect them).
If they just disappear, then maybe through the slashing mechanism we could figure out that they have gone and make another backup. For example, the propagation pallet makes a request with the TSS ids of leaving validator(s), and if they match out backup provider then we make another call to make_key_backup. Otherwise, well we just cannot recover.
Yeah similar to my second point above, we'll end up sending a bunch of messages out to the TSS servers. Kind of annoying but I guess it can work. I wouldn't implement this soon though, seems a bit overkill for now.
@JesseAbram suggested sending the key out to several other tss nodes: #1247 (comment)
But i want to keep things simple in this PR and then iterate, as deciding how many backups to make is a tricky > design decision where theres a security tradeoff.
I'd like to avoid spreading sensitive material around the network, as the design starts to become similar to our previous approach where we were sending keyshares to other TSS servers. And yep, it also does add complexity to this whole feature.
I think its worth bearing in mind that not being able to recover is only really a big problem if you are a signer, and even then we can tolerate losing n - t keyshares (although we haven't yet figured out the practicalities doing a reshare in that situation).
Yeah fair enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya but if the keys are stored in the memory of the tdx they are unreachable and within our current security model does not weaken the system only strengthens redundancy
@@ -114,107 +112,3 @@ pub type EncodedVerifyingKey = [u8; VERIFICATION_KEY_LENGTH as usize]; | |||
#[cfg(not(feature = "wasm"))] | |||
pub type BoundedVecEncodedVerifyingKey = | |||
sp_runtime::BoundedVec<u8, sp_runtime::traits::ConstU32<VERIFICATION_KEY_LENGTH>>; | |||
|
|||
/// Input data to be included in a TDX attestation | |||
pub struct QuoteInputData(pub [u8; 64]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up doing a refactor here to put all the attestation-related stuff into an attestation
module as it was starting to get a mess.
I should really have put this in a separate PR as it effects the pallets which are otherwise untouched by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed probably should've been done in a follow up PR 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and gives us a bit of resilience in case of process crashes 👍
I'd say we first merge the PR this is built on top of, rebase this PR onto master
, and then merge it. If it's not too much work I'd also consider splitting out the Attestation refactor stuff into a follow-up PR, but nbd if you can't do it.
@@ -114,107 +112,3 @@ pub type EncodedVerifyingKey = [u8; VERIFICATION_KEY_LENGTH as usize]; | |||
#[cfg(not(feature = "wasm"))] | |||
pub type BoundedVecEncodedVerifyingKey = | |||
sp_runtime::BoundedVec<u8, sp_runtime::traits::ConstU32<VERIFICATION_KEY_LENGTH>>; | |||
|
|||
/// Input data to be included in a TDX attestation | |||
pub struct QuoteInputData(pub [u8; 64]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed probably should've been done in a follow up PR 😅
@@ -221,30 +225,31 @@ pub struct AppState { | |||
pub configuration: Configuration, | |||
/// Key-value store | |||
pub kv_store: KvManager, | |||
/// Storage for encryption key backups for other TSS nodes | |||
/// Maps TSS account id to encryption key | |||
pub encryption_key_backups: Arc<RwLock<HashMap<[u8; 32], [u8; 32]>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subxt variant doesn't but the sp-core
one does - and I think we can convert between them
tss_account: SubxtAccountId32, | ||
/// An ephemeral encryption public key used to receive and encrypted response | ||
response_key: X25519PublicKey, | ||
/// A TDX quote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to specify what fields are expected here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what you mean. You mean what we expect as input data to the quote?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly. Like if somebody were crafting a quote, how would they do so in order for it to pass verification
) -> Result<(), BackupProviderError> { | ||
let tss_account = SubxtAccountId32(sr25519_pair.public().0); | ||
// Select a provider by making chain query and choosing a tss node | ||
let key_provider_details = select_backup_provider(api, rpc, tss_account).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the 'backup provider' unbonds cleanly, we need a way knowing this so we can make another backup.
If we put the onus onto the TSS here, they'd need to be querying the chain at every session change or anytime a validator changes, which might be annoying. If we make it the chain's responsibility to inform TSS servers then we might have to put backup information on-chain, or just send a request to all TSS servers informing them of this (even if it doesn't necessarily affect them).
If they just disappear, then maybe through the slashing mechanism we could figure out that they have gone and make another backup. For example, the propagation pallet makes a request with the TSS ids of leaving validator(s), and if they match out backup provider then we make another call to make_key_backup. Otherwise, well we just cannot recover.
Yeah similar to my second point above, we'll end up sending a bunch of messages out to the TSS servers. Kind of annoying but I guess it can work. I wouldn't implement this soon though, seems a bit overkill for now.
@JesseAbram suggested sending the key out to several other tss nodes: #1247 (comment)
But i want to keep things simple in this PR and then iterate, as deciding how many backups to make is a tricky > design decision where theres a security tradeoff.
I'd like to avoid spreading sensitive material around the network, as the design starts to become similar to our previous approach where we were sending keyshares to other TSS servers. And yep, it also does add complexity to this whole feature.
I think its worth bearing in mind that not being able to recover is only really a big problem if you are a signer, and even then we can tolerate losing n - t keyshares (although we haven't yet figured out the practicalities doing a reshare in that situation).
Yeah fair enough
I think i have now addressed everything from @HCastano 's review. Gonna leave this up till early next week so @JesseAbram can have a look |
…ready (#1263) * Add an extra TSS state for connected to chain but not funded / fully ready * Clippy
@JesseAbram id like to get your thumbs up on this before merging - doesn't need to be a deep dive as hernando has already done some nitpicking, just wanna be sure you are into the idea generally. We can still iterate on the design - im not planning to close the feature issue on merging this. |
Part of #1247
This adds a way to backup and recover the symmetric encryption key for the key-value store, which is used to store the secret keys of the TSS account and x25519 encryption keypair, as well as the network keyshare.
This is used to recover following a VM process restart, meaning we can keep the contents of the key-value database persistent, but inaccessible to the TSS node operator.
Backups are provided by other TSS nodes - that is every TSS node can hold encryption keys for other TSS node's data.
When a TSS node starts, it checks for the presence of an existing database.
This PR is based off #1216 because i prefer having the keypair directly accessible from
AppState
. But it could be rebased to master.Unresolved issues / TODOs:
KvManager
during testing.setup_kv_store
ready
state. However, since we need to make on-chain queries this is not possible. This could result in a deadlock where the initial genesis nodes cannot become ready because they cannot find another ready node to provide a backup. Since we only need read access to the chain, this could be solved by replacingready: bool
with a multi-state enum which means we can require that we have access to the chain, but don't yet need to be funded or have made a backup ourselves in order to provide a backup.