Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database encryption key backup / recovery feature for entropy-tss #1249

Open
wants to merge 43 commits into
base: peg/non-persistant-tss-keys
Choose a base branch
from

Conversation

ameba23
Copy link
Contributor

@ameba23 ameba23 commented Jan 14, 2025

Part of #1247

This adds a way to backup and recover the symmetric encryption key for the key-value store, which is used to store the secret keys of the TSS account and x25519 encryption keypair, as well as the network keyshare.

This is used to recover following a VM process restart, meaning we can keep the contents of the key-value database persistent, but inaccessible to the TSS node operator.

Backups are provided by other TSS nodes - that is every TSS node can hold encryption keys for other TSS node's data.

When a TSS node starts, it checks for the presence of an existing database.

  • If there is no database, a fresh encryption key is generated, a new database is initialized, and as part of the 'prerequisite' checks, another TSS node is chose pseudo-randomly, the encryption key is sent to them to be backed-up, and their details are stored on-disk in plaintext.
  • If there is an existing database, the details of the backup provider are retrieved and a recovery of the encryption key is requested. This is a two step process involving requesting a nonce and then sending a TDX quote with that nonce. On verifying the quote, the backup provider responds with the encryption key.

This PR is based off #1216 because i prefer having the keypair directly accessible from AppState. But it could be rebased to master.

Unresolved issues / TODOs:

  • We need an additional request-response cycle when requesting to recover a key, to get a quote nonce
  • Quote verification logic should probably be moved somewhere common (eg: entropy-shared) as it is now duplicated in the attestation pallet and entropy-tss
  • We need pre-known encryption keys for test validators alice, bob, etc, as we don't have access to their KvManager during testing.
  • Need to merge master in order to have the tdx quote errors with Error trait implemented from latest version
  • Rm unwraps from setup_kv_store
  • Originally i planned to have the HTTP endpoints to backup and recover work even when not yet in a ready state. However, since we need to make on-chain queries this is not possible. This could result in a deadlock where the initial genesis nodes cannot become ready because they cannot find another ready node to provide a backup. Since we only need read access to the chain, this could be solved by replacing ready: bool with a multi-state enum which means we can require that we have access to the chain, but don't yet need to be funded or have made a backup ourselves in order to provide a backup.
  • A way of the chain notifying entropy-tss when another TSS node leaves the validator set (either 'cleanly' through unbonding intentionally, or by being slashed following being unresponsive). This is needed so we can make another encryption key backup. (and maybe rotate the encryption key)

@ameba23 ameba23 changed the base branch from master to peg/non-persistant-tss-keys January 14, 2025 11:49
@ameba23 ameba23 marked this pull request as draft January 14, 2025 11:50
@ameba23 ameba23 self-assigned this Jan 14, 2025
@ameba23 ameba23 added the Feature introduces a new feature label Jan 14, 2025
@ameba23 ameba23 added this to the v0.4.0 milestone Jan 14, 2025
)
.await
.map_err(|e| {
tracing::error!("Could not make key backup: {}", e);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than giving up here, we could have a loop which attempts to request backups from other tss servers, until we find one which works, and also report the ones who failed to make a backup for us. I would leave this as a followup.

@ameba23 ameba23 changed the title Key provider / recovery feature for entropy-tss Database encryption key backup / recovery feature for entropy-tss Jan 21, 2025
@ameba23 ameba23 marked this pull request as ready for review January 21, 2025 13:14
Copy link
Collaborator

@HCastano HCastano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on this 💪

Still making my way through the review of this, will try and finish it up soon 🙏

crates/shared/src/types.rs Outdated Show resolved Hide resolved
) -> Result<(), BackupProviderError> {
let tss_account = SubxtAccountId32(sr25519_pair.public().0);
// Select a provider by making chain query and choosing a tss node
let key_provider_details = select_backup_provider(api, rpc, tss_account).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we try and recover our DB at a later point in time and the selected server isn't available anymore?

Copy link
Contributor Author

@ameba23 ameba23 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

If the 'backup provider' unbonds cleanly, we need a way knowing this so we can make another backup. If they just disappear, then maybe through the slashing mechanism we could figure out that they have gone and make another backup. For example, the propagation pallet makes a request with the TSS ids of leaving validator(s), and if they match out backup provider then we make another call to make_key_backup. Otherwise, well we just cannot recover.

@JesseAbram suggested sending the key out to several other tss nodes: #1247 (comment)

But i want to keep things simple in this PR and then iterate, as deciding how many backups to make is a tricky design decision where theres a security tradeoff.

I think its worth bearing in mind that not being able to recover is only really a big problem if you are a signer, and even then we can tolerate losing n - t keyshares (although we haven't yet figured out the practicalities doing a reshare in that situation).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the 'backup provider' unbonds cleanly, we need a way knowing this so we can make another backup.

If we put the onus onto the TSS here, they'd need to be querying the chain at every session change or anytime a validator changes, which might be annoying. If we make it the chain's responsibility to inform TSS servers then we might have to put backup information on-chain, or just send a request to all TSS servers informing them of this (even if it doesn't necessarily affect them).

If they just disappear, then maybe through the slashing mechanism we could figure out that they have gone and make another backup. For example, the propagation pallet makes a request with the TSS ids of leaving validator(s), and if they match out backup provider then we make another call to make_key_backup. Otherwise, well we just cannot recover.

Yeah similar to my second point above, we'll end up sending a bunch of messages out to the TSS servers. Kind of annoying but I guess it can work. I wouldn't implement this soon though, seems a bit overkill for now.

@JesseAbram suggested sending the key out to several other tss nodes: #1247 (comment)

But i want to keep things simple in this PR and then iterate, as deciding how many backups to make is a tricky > design decision where theres a security tradeoff.

I'd like to avoid spreading sensitive material around the network, as the design starts to become similar to our previous approach where we were sending keyshares to other TSS servers. And yep, it also does add complexity to this whole feature.

I think its worth bearing in mind that not being able to recover is only really a big problem if you are a signer, and even then we can tolerate losing n - t keyshares (although we haven't yet figured out the practicalities doing a reshare in that situation).

Yeah fair enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya but if the keys are stored in the memory of the tdx they are unreachable and within our current security model does not weaken the system only strengthens redundancy

crates/threshold-signature-server/src/attestation/api.rs Outdated Show resolved Hide resolved
@@ -114,107 +112,3 @@ pub type EncodedVerifyingKey = [u8; VERIFICATION_KEY_LENGTH as usize];
#[cfg(not(feature = "wasm"))]
pub type BoundedVecEncodedVerifyingKey =
sp_runtime::BoundedVec<u8, sp_runtime::traits::ConstU32<VERIFICATION_KEY_LENGTH>>;

/// Input data to be included in a TDX attestation
pub struct QuoteInputData(pub [u8; 64]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up doing a refactor here to put all the attestation-related stuff into an attestation module as it was starting to get a mess.

I should really have put this in a separate PR as it effects the pallets which are otherwise untouched by this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed probably should've been done in a follow up PR 😅

Copy link
Collaborator

@HCastano HCastano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and gives us a bit of resilience in case of process crashes 👍

I'd say we first merge the PR this is built on top of, rebase this PR onto master, and then merge it. If it's not too much work I'd also consider splitting out the Attestation refactor stuff into a follow-up PR, but nbd if you can't do it.

@@ -114,107 +112,3 @@ pub type EncodedVerifyingKey = [u8; VERIFICATION_KEY_LENGTH as usize];
#[cfg(not(feature = "wasm"))]
pub type BoundedVecEncodedVerifyingKey =
sp_runtime::BoundedVec<u8, sp_runtime::traits::ConstU32<VERIFICATION_KEY_LENGTH>>;

/// Input data to be included in a TDX attestation
pub struct QuoteInputData(pub [u8; 64]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed probably should've been done in a follow up PR 😅

crates/threshold-signature-server/src/lib.rs Outdated Show resolved Hide resolved
@@ -221,30 +225,31 @@ pub struct AppState {
pub configuration: Configuration,
/// Key-value store
pub kv_store: KvManager,
/// Storage for encryption key backups for other TSS nodes
/// Maps TSS account id to encryption key
pub encryption_key_backups: Arc<RwLock<HashMap<[u8; 32], [u8; 32]>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subxt variant doesn't but the sp-core one does - and I think we can convert between them

crates/threshold-signature-server/src/lib.rs Outdated Show resolved Hide resolved
crates/threshold-signature-server/src/main.rs Show resolved Hide resolved
tss_account: SubxtAccountId32,
/// An ephemeral encryption public key used to receive and encrypted response
response_key: X25519PublicKey,
/// A TDX quote
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to specify what fields are expected here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what you mean. You mean what we expect as input data to the quote?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly. Like if somebody were crafting a quote, how would they do so in order for it to pass verification

) -> Result<(), BackupProviderError> {
let tss_account = SubxtAccountId32(sr25519_pair.public().0);
// Select a provider by making chain query and choosing a tss node
let key_provider_details = select_backup_provider(api, rpc, tss_account).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the 'backup provider' unbonds cleanly, we need a way knowing this so we can make another backup.

If we put the onus onto the TSS here, they'd need to be querying the chain at every session change or anytime a validator changes, which might be annoying. If we make it the chain's responsibility to inform TSS servers then we might have to put backup information on-chain, or just send a request to all TSS servers informing them of this (even if it doesn't necessarily affect them).

If they just disappear, then maybe through the slashing mechanism we could figure out that they have gone and make another backup. For example, the propagation pallet makes a request with the TSS ids of leaving validator(s), and if they match out backup provider then we make another call to make_key_backup. Otherwise, well we just cannot recover.

Yeah similar to my second point above, we'll end up sending a bunch of messages out to the TSS servers. Kind of annoying but I guess it can work. I wouldn't implement this soon though, seems a bit overkill for now.

@JesseAbram suggested sending the key out to several other tss nodes: #1247 (comment)

But i want to keep things simple in this PR and then iterate, as deciding how many backups to make is a tricky > design decision where theres a security tradeoff.

I'd like to avoid spreading sensitive material around the network, as the design starts to become similar to our previous approach where we were sending keyshares to other TSS servers. And yep, it also does add complexity to this whole feature.

I think its worth bearing in mind that not being able to recover is only really a big problem if you are a signer, and even then we can tolerate losing n - t keyshares (although we haven't yet figured out the practicalities doing a reshare in that situation).

Yeah fair enough

@ameba23
Copy link
Contributor Author

ameba23 commented Jan 24, 2025

I think i have now addressed everything from @HCastano 's review. Gonna leave this up till early next week so @JesseAbram can have a look

…ready (#1263)

* Add an extra TSS state for connected to chain but not funded / fully ready

* Clippy
@ameba23 ameba23 requested a review from JesseAbram January 30, 2025 08:03
@ameba23
Copy link
Contributor Author

ameba23 commented Jan 30, 2025

@JesseAbram id like to get your thumbs up on this before merging - doesn't need to be a deep dive as hernando has already done some nitpicking, just wanna be sure you are into the idea generally. We can still iterate on the design - im not planning to close the feature issue on merging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature introduces a new feature
Projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

Design for a recovery feature using other entropy-tss nodes as key-providers
3 participants