Before starting, **please make sure to have a clear understanding of node and validator operational requirements**. This guide is medium to high difficulty. Operation of horcrux
assumes significant prior knowledge of these systems. Debugging problems that may arise will entail a significant amount financial risk (double sign) if you are running on mainnet so a clear understanding of the systems you are working with is important. Please attempt this operation on a testnet before you do so on a mainnet validator.
CAUTION: This operation will require you to take your validator down for some time. If you work quickly and follow the guide, this downtime shouldn't be more than 5-10 minutes. But regardless, be aware of the downtime slashing on your chain and be careful not to exceed that limit.
This document will describe a migration from a "starting system" to a 2-of-3 multisig cluster running horcrux
, signing blocks for an array of 3 sentry nodes connected to the p2p network for your particular network. The starting system is a single node performing all these operations: i.e. a full node that is also a validator node which is signing with a $NODE_HOME/config/priv_validator_key.json
running on a single VM. If you have a different starting system (say 2 sentry nodes and a validator connected to them) map the existing resources onto the desired final state to make your migration with a similar structure to what is described here.
- VM: 4 CPU, 16 GB RAM, 500GB SSD storage running fully synced chain daemon also acting as a validator
- Sentries: 3x VM w/ 4 CPU, 16GB RAM, 500GB SSD storage running fully synced chain daemon
- These chain daemons should only expose the
:26656
(p2p) port to the open internet - The daemons will need to expose
:1234
(priv validator port) to thehorcrux
nodes, but not to the open internet
- These chain daemons should only expose the
- Signers: 3x VM w/ 1 CPU, 1 GB RAM, 20 GB SSD storage running
horcux
- These nodes should not expose any ports to the open internet and should only connect with the sentries
The first step to the migration is to sync the full nodes you will be using as sentries. To follow this guide, ensure that you have 3 nodes from the chain you are validating on which are synced. Follow the instructions for the individual chain for spinning up those nodes. This is the part of setting up horcrux
that takes the longest.
NOTE: This is also a great usecase for state sync. Or one of the quick sync services that exist.
To setup the signer nodes, start by recording the private IPs for each of the signer and sentry nodes. Order matters, and you will need these values to configure the signers. Make a table like so:
# EXAMPLE
sentry-1: 10.168.0.1
sentry-2: 10.168.0.2
sentry-3: 10.168.0.3
signer-1: 10.168.1.1
signer-2: 10.168.1.2
signer-3: 10.168.1.3
When installing horcrux
we recommend using the prebuilt binaries from the releases page. Pick the release corresponding to the tendermint
dependency for the go.mod
of your chain binary. You should be able to get this with {binary} version --long
. Install like so:
# On each signer VM
$ wget https://github.com/strangelove-ventures/horcrux/releases/download/v2.0.0/horcrux_2.0.0_linux_amd64.tar.gz
$ tar -xzf horcrux_2.0.0_linux_amd64.tar.gz
$ sudo mv horcrux /usr/bin/horcrux && rm horcrux_2.0.0_linux_amd64.tar.gz README.md LICENSE.md
Once the binary is installed in /usr/bin
, install the systemd
unit file. You can find an example here:
# On each signer VM
$ sudo nano /etc/systemd/system/horcrux.service
# copy file contents and modify to fit your environment
$ sudo systemctl daemon-reload
After that is done, initialize the configuration for each node using the horcrux
cli. Each node will require a slightly different command. Below are the commands for each of the 3 signer nodes given the private IPs above. Input your own data here:
# Run this command on the signer-1 VM
# signer-1 connects to sentry-1
$ horcrux config init {my_chain_id} "tcp://10.168.0.1:1234" -c -p "tcp://10.168.1.2:2222|2,tcp://10.168.1.3:2222|3" -l "tcp://10.168.1.1:2222" -t 2 --timeout 1500ms
# Run this command on the signer-2 VM
# signer-2 connects to sentry-2
$ horcrux config init {my_chain_id} "tcp://10.168.0.2:1234" -c -p "tcp://10.168.1.1:2222|1,tcp://10.168.1.3:2222|3" -l "tcp://10.168.1.2:2222" -t 2 --timeout 1500ms
# Run this command on the signer-3 VM
# signer-3 connects to sentry-3
$ horcrux config init {my_chain_id} "tcp://10.168.0.3:1234" -c -p "tcp://10.168.1.1:2222|1,tcp://10.168.1.2:2222|2" -l "tcp://10.168.1.3:2222" -t 2 --timeout 1500ms
NOTE: Note the node address (e.g. "tcp://10.168.0.1:1234") of each command. In this example, each horcrux node is communicating with a corresponding sentry. It is also possible to include a comma separated list of node addresses (e.g. "tcp://chain-node-1:1234,tcp://chain-node-2:1234", etc), allowing all horcrux nodes to communicate with all sentries.
NOTE: The
-c
or--cosigner
flag here says to configure the signer for cosigner operations. The signer can also be run in single signer configuration, if you want to do that don't pass-c
,-p
or-t
or--timeout
.
NOTE: The
-p
or--peers
flag lets you set the addresses of the other signer nodes in the config. Two ports are required, the P2P port for RCP traffic, and the Raft port for key-value sharing. Note that each signer also has an index. This index corresponds to the shard of the private key it will sign with. Keeping the node names and the indexes the same helps avoid errors and allows you to work more quickly
NOTE: The
-l
or--listen
flag lets you set the listen address for the cosigner, which is used for communication between cosigners, Raft and GRPC. The DNS/IP used for this must be reachable by the other peers, i.e. do not use 0.0.0.0 for the hostname.
NOTE: The
-k
or--keyfile
flag lets you set the file path for the private key share file if you would like to use a different path than~/.horcrux/share.json
.
NOTE: The
--timeout
value defaults to1000ms
. If you are running in disconnected data centers (i.e. across amazon AZs or gcp zones) increasing the timeout slightly helps to avoid missed blocks especially around proposals.
CAUTION: The security of any key material is outside the scope of this guide. The suggested procedure here is not necessarily the one you will use. We aim to make this guide easy to understand, not necessarily the most secure. The tooling here is all written in go and can be compiled and used in an airgapped setup if needed. Please open issues if you have questions about how to fit
horcrux
into your infra.
On some computer that contains your priv_validator_key.json
create a folder to split the key through the following command. This may take a moment o complete:
$ ls
priv_validator_key.json
$ horcrux create-shares priv_validator_key.json 2 3
Created Share 1
Created Share 2
Created Share 3
$ ls
priv_validator_key.json
private_share_1.json
private_share_2.json
private_share_3.json
The shares need to be moved their co-responding signer nodes at ~/.horcrux/share.json
. It is very important to make sure the share id (in private_share_<id>.json
) is on the corresponding cosigner node otherwise your signer cluster won't communicate properly and will not sign blocks. If you have named your nodes with their index as the signer index, as in this guide, this operation should be easy to check.
At the end of this step, each of your horcrux nodes will have a ~/.horcrux/share.json
file with the contents matching the appropriate private_share_<id>.json
file corresponding to the node number.
Now is the moment of truth. There will be a few minutes of downtime for this step, so ensure you have read the following directions completely before moving forward.
You need to take your validator offline and trust that the horcrux
setup you have created is going to pick up signing for you soon. Ensure the validator is off and not signing.
NOTE: Leave your current validator turned off, but able to be restarted to resume signing in case of failure. When you are certain that the
horcrux
cluster is signing for you and your validator is back online it will be safe to decommission the old infrastructure.
Once the validator has been stopped, you will need the contents of the $NODE_HOME/data/priv_validator_state.json
file. This file represents the last time your validator key was used to sign for consensus and acts as a "high water" mark to prevent your validator from doublesigning. horcrux
uses the same file structure to provide this service. Each node maintains the last state that the node signed as well as the last state the whole cluster signed. In this way we can assure that the cluster doesn't doublesign. It should look something like the below:
{
"height": "361402",
"round": 0,
"step": 3,
"signature": "IEOS7EJ8C6ZZxwwXiGeMhoO8mwtgTiq6VPR/F1cpLZuz0ZvUZdsgQjTt0GniAIgosfEjC5izKw4Nvvs3ZIceAw==",
"signbytes": "6B080211BA8305000000000022480A205D4E1F722F53A3FD9E0D28639D7CE7B588338570EBA5C340687C30609C47BCA41224080112208283B6E16BEA46797F8AD4EE0ACE424AC7A4827202446B2D56E7F4438541B7BD2A0C08E4ACE28B0610CCD0AC830232066A756E6F2D31"
}
You will need to replace the contents of the ~/.horcrux/state/{chain-id}_priv_validator_state.json
and ~/.horcrux/state/{chain-id}_share_sign_state.json
on each signer node with a truncated and slightly modified version of the file. Note the ""
especially on the "round"
value:
{
"height": "361402",
"round": "0",
"step": 3
}
NOTE: This step can be error prone. We will be adding a feature to allow using the CLI to set these values but for now
nano
/vi
,cat
andjq
are your friends.
Once you have all of the signer nodes fully configured its time to start them. Start all of them at roughly the same time:
sudo systemctl start horcrux && journalctl -u horcrux -f
The following logs should be flowing on each signer node:
I[2021-09-24|02:10:09.022] Tendermint Validator module=validator mode=mpc priv_key=...
I[2021-09-24|02:10:09.023] Starting CosignerRPCServer service module=validator impl=CosignerRPCServer
I[2021-09-24|02:10:09.025] Signer module=validator pubkey=PubKeyEd25519{9A66109B69C...
I[2021-09-24|02:10:09.025] Starting RemoteSigner service module=validator impl=RemoteSigner
E[2021-09-24|02:10:09.027] Dialing module=validator err="dial tcp 10.180.0.16:1234...
I[2021-09-24|02:10:09.027] Retrying module=validator sleep(s)=3 address=tcp://10.180...
...
The signer will continue retrying attempts to reach the sentries until we turn the sentry priv_validator
listener on in the next step. Any panic causing errors are likely due to one of the two following issues:
- Misnaming or incorrect structure of the files in
~/.horcrux/state
. Double check these if you see errors - Misnaming or misplacement of the
~/.horcrux/share.json
file
NOTE: leaving these logs streaming in seperate terminal windows will enable you to watch the cluster connect to the sentries.
Once the signer cluster has started successfully its time to reconfigure and restart your sentry nodes. On each node enable the priv validator listener and verify config changes with the following commands:
$ sed -i 's#priv_validator_laddr = ""#priv_validator_laddr = "tcp://0.0.0.0:1234"#g' $NODE_HOME/config/config.toml
$ cat $NODE_HOME/config/config.toml | grep priv_validator_laddr
priv_validator_laddr = "tcp://0.0.0.0:1234"
Ensure any local or network firewalls on the sentry machines are allowing communication from the horcrux cluster to port 1234. Next, restart your nodes for the changes to take effect and see them connect to the signer cluster:
$ sudo systemctl restart {node_service} && journalctl -u {node_service} -f
Common failure modes:
- Ports on your cloud service aren't properly configured and prevent signers/sentries from communicating
- Node crashes because the signer didn't retry in time, can be fixed by trying again and/or restarting signer. May take some fiddling
You now can sleep much better at night because you are much less likely to have a down validator wake you up in the middle of the night. You have also completed a stressful migration on a production system. Go run around outside screaming, pet your dog, eat a nice meal, hug your kids/significant other, etc... and enjoy the rest of your day!
horcrux elect
- Elect a new cluster leader. Pass an optional argument with the intended leader ID to elect that cosigner as the new leader, e.g. horcrux elect 3
to elect cosigner with ID: 3
as leader
horcrux cosigner address
- Get the public key address as both hex and optionally the validator consensus bech32 address. To retrieve the valcons bech32 address, pass an optional argument with the chain's bech32 valcons prefix, e.g. horcrux cosigner address cosmosvalcons