diff --git a/neps/nep-0568.md b/neps/nep-0568.md index 5491ba349..1aa488a5d 100644 --- a/neps/nep-0568.md +++ b/neps/nep-0568.md @@ -10,75 +10,48 @@ Created: 2024-10-24 LastUpdated: 2024-10-24 --- -## Summary +## Summary -This proposal introduces a new resharding implementation and shard layout for -production networks. +This proposal introduces a new resharding implementation and shard layout for production networks. -The primary objective of Resharding V3 is to increase chain capacity by -splitting overutilized shards. A secondary aim is to lay the groundwork for -supporting Dynamic Resharding, Instant Resharding and Shard Merging in future -updates. +The primary objective of Resharding V3 is to increase chain capacity by splitting overutilized shards. A secondary aim is to lay the groundwork for supporting Dynamic Resharding, Instant Resharding, and Shard Merging in future updates. ## Motivation -The sharded architecture of the NEAR Protocol is a cornerstone of its design, enabling parallel and distributed execution that significantly boosts overall throughput. Resharding plays a pivotal role in this system, allowing the network to adjust the number of shards to accommodate growth. By increasing the number of shards, resharding ensures the network can scale seamlessly, alleviating existing congestion, managing the rising traffic demands, and welcoming new customers. This adaptability is essential for maintaining the protocol's performance, reliability, and capacity to support a thriving, ever-expanding ecosystem. +The sharded architecture of the NEAR Protocol is a cornerstone of its design, enabling parallel and distributed execution that significantly boosts overall throughput. Resharding plays a pivotal role in this system, allowing the network to adjust the number of shards to accommodate growth. By increasing the number of shards, resharding ensures the network can scale seamlessly, alleviating existing congestion, managing rising traffic demands, and welcoming new customers. This adaptability is essential for maintaining the protocol's performance, reliability, and capacity to support a thriving, ever-expanding ecosystem. -Resharding V3 is a significantly redesigned approach, addressing limitations of -the previous versions, [Resharding V1][NEP-040] and [Resharding V2][NEP-508]. -The earlier solutions became obsolete due to major protocol changes since -Resharding V2, including the introduction of Stateless Validation, Single Shard -Tracking, and Mem-Trie. +Resharding V3 is a significantly redesigned approach, addressing limitations of the previous versions, [Resharding V1][NEP-040] and [Resharding V2][NEP-508]. The earlier solutions became obsolete due to major protocol changes since Resharding V2, including the introduction of Stateless Validation, Single Shard Tracking, and Mem-Trie. ## Specification -Resharding will be scheduled in advance by the NEAR developer team. The new -shard layout will be hardcoded into the neard binary and linked to the protocol -version. As the protocol upgrade progresses, resharding will be triggered during -the post-processing phase of the last block of the epoch. At this point, the -state of the parent shard will be split between two child shards. From the first -block of the new protocol version onward, the chain will operate with the new -shard layout. - -There are two key dimensions to consider: state storage and protocol features, -along with a few additional details. - -1) State Storage: Currently, the state of a shard is stored in three distinct -formats: the state, the flat state, and the mem-trie. Each of these -representations must be resharded. Logically, resharding is an almost -instantaneous event that occurs before the first block under the new shard -layout. However, in practice, some of this work may be deferred to -post-processing, as long as the chain's view reflects a fully resharded state. - -1) Protocol Features: Several protocol features must integrate smoothly with the - resharding process, including: - -* Stateless Validation: Resharding must be validated and proven through - stateless validation mechanisms. -* State Sync: Nodes must be able to sync the states of the child - shards post-resharding. -* Cross-Shard Traffic: Receipts sent to the parent shard may need to be - reassigned to one of the child shards. -* Receipt Handling: Delayed, postponed, buffered, and promise-yield receipts - must be correctly distributed between the child shards. -* ShardId Semantics: The shard identifiers will become abstract identifiers - where today they are number in the 0..num_shards range. -* Congestion Info: CongestionInfo in the chunk header would be recalculated for the child - shards at the resharding boundary. Proof must be compatible with Stateless Validation. +Resharding will be scheduled in advance by the NEAR developer team. The new shard layout will be hardcoded into the neard binary and linked to the protocol version. As the protocol upgrade progresses, resharding will be triggered during the post-processing phase of the last block of the epoch. At this point, the state of the parent shard will be split between two child shards. From the first block of the new protocol version onward, the chain will operate with the new shard layout. + +There are two key dimensions to consider: state storage and protocol features, along with a few additional details. + +1. **State Storage**: Currently, the state of a shard is stored in three distinct formats: the state, the flat state, and the mem-trie. Each of these representations must be resharded. Logically, resharding is an almost instantaneous event that occurs before the first block under the new shard layout. However, in practice, some of this work may be deferred to post-processing, as long as the chain's view reflects a fully resharded state. + +2. **Protocol Features**: Several protocol features must integrate smoothly with the resharding process, including: + + * **Stateless Validation**: Resharding must be validated and proven through stateless validation mechanisms. + * **State Sync**: Nodes must be able to sync the states of the child shards post-resharding. + * **Cross-Shard Traffic**: Receipts sent to the parent shard may need to be reassigned to one of the child shards. + * **Receipt Handling**: Delayed, postponed, buffered, and promise-yield receipts must be correctly distributed between the child shards. + * **ShardId Semantics**: The shard identifiers will become abstract identifiers where today they are numbers in the 0..num_shards range. + * **Congestion Info**: CongestionInfo in the chunk header would be recalculated for the child shards at the resharding boundary. Proof must be compatible with Stateless Validation. ### State Storage - MemTrie MemTrie is the in-memory representation of the trie that the runtime uses for all trie accesses. This is kept in sync with the Trie representation in state. -As of today it isn't mandatory for nodes to have MemTrie feature enabled but going forward, with ReshardingV3, all nodes would require to have MemTrie enabled for resharding to happen successfully. +As of today, it isn't mandatory for nodes to have the MemTrie feature enabled, but going forward, with Resharding V3, all nodes will be required to have MemTrie enabled for resharding to happen successfully. For the purposes of resharding, we need an efficient way to split the MemTrie into two child tries based on the boundary account. This splitting happens at the epoch boundary when the new epoch is expected to have the two child shards. The set of requirements around MemTrie splitting are: -* MemTrie splitting needs to be "instant", i.e. happen efficiently within the span of one block. The child tries need to be available for the processing of the next block in the new epoch. -* MemTrie splitting needs to be compatible with stateless validation, i.e. we need to generate a proof that the memtrie split proposed by the chunk producer is correct. -* The proof generated for splitting the MemTrie needs to be compatible with the limits of the size of state witness that we send to all chunk validators. This prevents us from doing things like iterating through all trie keys for delayed receipts etc. +* MemTrie splitting needs to be "instant", i.e., happen efficiently within the span of one block. The child tries need to be available for the processing of the next block in the new epoch. +* MemTrie splitting needs to be compatible with stateless validation, i.e., we need to generate a proof that the MemTrie split proposed by the chunk producer is correct. +* The proof generated for splitting the MemTrie needs to be compatible with the limits of the size of the state witness that we send to all chunk validators. This prevents us from doing things like iterating through all trie keys for delayed receipts, etc. -With ReshardingV3 design, there's no protocol change to the structure of MemTries, however the implementation constraints required us to introduce the concept of a Frozen MemTrie. More details are in the [implementation](#state-storage---memtrie-1) section below. +With the Resharding V3 design, there's no protocol change to the structure of MemTries; however, the implementation constraints required us to introduce the concept of a Frozen MemTrie. More details are in the [implementation](#state-storage---memtrie-1) section below. Based on the requirements above, we came up with an algorithm to efficiently split the parent trie into two child tries. Trie entries can be divided into three categories based on whether the trie keys have an `account_id` prefix and based on the total number of such trie keys. Splitting of these keys is handled in different ways. @@ -94,236 +67,182 @@ This category includes the trie keys `TrieKey::DelayedReceiptIndices`, `TrieKey: #### Indexed TrieKey -This category includes the trie keys `TrieKey::DelayedReceipt`, `TrieKey::PromiseYieldTimeout` and `TrieKey::BufferedReceipt`. The number of entries for these keys can potentially be arbitrarily large and it's not feasible to iterate through all the entries. In pre-stateless validation world, where we didn't care about state witness size limits, for ReshardingV2 we could just iterate over all delayed receipts and split them into the respective child shards. +This category includes the trie keys `TrieKey::DelayedReceipt`, `TrieKey::PromiseYieldTimeout`, and `TrieKey::BufferedReceipt`. The number of entries for these keys can potentially be arbitrarily large, and it's not feasible to iterate through all the entries. In the pre-stateless validation world, where we didn't care about state witness size limits, for Resharding V2 we could just iterate over all delayed receipts and split them into the respective child shards. -For ReshardingV3, these are handled by either of the two strategies +For Resharding V3, these are handled by either of the two strategies: * `TrieKey::DelayedReceipt` and `TrieKey::PromiseYieldTimeout` are handled by duplicating entries across both child shards as each entry could belong to either of the child shards. More details in the [Delayed Receipts](#delayed-receipt-handling) and [Promise Yield](#promiseyield-receipt-handling) sections below. -* `TrieKey::BufferedReceipt` are independent of the account_id and therefore can be sent to either of the child shards, but not both. We copy the buffered receipts and the associated metadata to the child shard with the lower index. More details in the [Buffered Receipts](#buffered-receipt-handling) section below. +* `TrieKey::BufferedReceipt` is independent of the account_id and therefore can be sent to either of the child shards, but not both. We copy the buffered receipts and the associated metadata to the child shard with the lower index. More details in the [Buffered Receipts](#buffered-receipt-handling) section below. ### State Storage - Flat State -Flat State is a collection of key-value pairs stored on disk and each entry -contains a reference to its ShardId. When splitting a shard, every item inside -its Flat State must be correctly reassigned to either one of the new children; -due to technical limitations such operation can't be completed instantaneously. +Flat State is a collection of key-value pairs stored on disk, and each entry contains a reference to its ShardId. When splitting a shard, every item inside its Flat State must be correctly reassigned to either one of the new children; due to technical limitations, such an operation can't be completed instantaneously. -Flat State main purposes are allowing the creation of State Sync snapshots and -the construction of Mem Tries. Fortunately, these two operations can be delayed -until resharding is completed. Note also that with Mem Tries enabled the chain -can move forward even if the current status of Flat State is not in sync with -the latest block. +Flat State's main purposes are allowing the creation of State Sync snapshots and the construction of Mem Tries. Fortunately, these two operations can be delayed until resharding is completed. Note also that with Mem Tries enabled, the chain can move forward even if the current status of Flat State is not in sync with the latest block. -For the reason stated above, the chosen strategy is to reshard Flat State in a -long-running background task. The new shards' states must converge with their -Mem Tries representation in a reasonable amount of time. +For the reason stated above, the chosen strategy is to reshard Flat State in a long-running background task. The new shards' states must converge with their Mem Tries representation in a reasonable amount of time. Splitting a shard's Flat State is performed in multiple steps: -1) A post-processing 'split' task is created during the last block of the old - shard layout, instantaneously. -2) The 'split' task runs in parallel with the chain for a certain amount of - time. Inside this routine every key-value pair belonging to the shard being - split (also called parent shard) is copied into either the left or the right - child Flat State. Entries linked to receipts are handled in a special way. -3) Once the task is completed, the parent shard Flat State is cleaned up. The - children shards Flat States have their state in sync with last block of the - old shard layout. -4) Children shards must apply the delta changes from the first block of the new - shard layout until the final block of the canonical chain. This operation is - done in another background task to avoid slowdowns while processing blocks. -5) Children shards Flat States are now ready and can be used to take State Sync - snapshots and to reload Mem Tries. +1) A post-processing 'split' task is created during the last block of the old shard layout, instantaneously. +2) The 'split' task runs in parallel with the chain for a certain amount of time. Inside this routine, every key-value pair belonging to the shard being split (also called the parent shard) is copied into either the left or the right child Flat State. Entries linked to receipts are handled in a special way. +3) Once the task is completed, the parent shard Flat State is cleaned up. The children shards' Flat States have their state in sync with the last block of the old shard layout. +4) Children shards must apply the delta changes from the first block of the new shard layout until the final block of the canonical chain. This operation is done in another background task to avoid slowdowns while processing blocks. +5) Children shards' Flat States are now ready and can be used to take State Sync snapshots and to reload Mem Tries. ### State Storage - State -Each shard’s Trie is stored in the `State` column of the database, with keys prefixed by `ShardUId`, followed by a node's hash. -This structure uniquely identifies each shard’s data. To avoid copying all entries under a new `ShardUId` during resharding, -a mapping strategy allows child shards to access ancestor shard data without directly creating new entries. +Each shard’s Trie is stored in the `State` column of the database, with keys prefixed by `ShardUId`, followed by a node's hash. This structure uniquely identifies each shard’s data. To avoid copying all entries under a new `ShardUId` during resharding, a mapping strategy allows child shards to access ancestor shard data without directly creating new entries. -A naive approach to resharding would involve copying all `State` entries with a new `ShardUId` for a child shard, effectively duplicating the state. -This method, while straightforward, is not feasible because copying a large state would take too much time. -Resharding needs to appear complete between two blocks, so a direct copy would not allow the process to occur quickly enough. +A naive approach to resharding would involve copying all `State` entries with a new `ShardUId` for a child shard, effectively duplicating the state. This method, while straightforward, is not feasible because copying a large state would take too much time. Resharding needs to appear complete between two blocks, so a direct copy would not allow the process to occur quickly enough. -To address this, Resharding V3 employs an efficient mapping strategy, using the `DBCol::ShardUIdMapping` column -to link each child shard’s `ShardUId` to the closest ancestor’s `ShardUId` holding the relevant data. -This allows child shards to access and update state data under the ancestor shard’s prefix without duplicating entries. +To address this, Resharding V3 employs an efficient mapping strategy, using the `DBCol::ShardUIdMapping` column to link each child shard’s `ShardUId` to the closest ancestor’s `ShardUId` holding the relevant data. This allows child shards to access and update state data under the ancestor shard’s prefix without duplicating entries. -Initially, `ShardUIdMapping` is empty, as existing shards map to themselves. During resharding, a mapping entry is added to `ShardUIdMapping`, -pointing each child shard’s `ShardUId` to the appropriate ancestor. Mappings persist as long as any descendant shard references the ancestor’s data. -Once a node stops tracking all children and descendants of a shard, the entry for that shard can be removed, allowing its data to be garbage collected. +Initially, `ShardUIdMapping` is empty, as existing shards map to themselves. During resharding, a mapping entry is added to `ShardUIdMapping`, pointing each child shard’s `ShardUId` to the appropriate ancestor. Mappings persist as long as any descendant shard references the ancestor’s data. Once a node stops tracking all children and descendants of a shard, the entry for that shard can be removed, allowing its data to be garbage collected. -This mapping strategy enables efficient shard management during resharding events, -supporting smooth transitions without altering storage structures directly. +This mapping strategy enables efficient shard management during resharding events, supporting smooth transitions without altering storage structures directly. #### Integration with cold storage (archival nodes) Cold storage uses the same mapping strategy to manage shard state during resharding: * When state data is migrated from hot to cold storage, it retains the parent shard’s `ShardUId` prefix, ensuring consistency with the mapping strategy. -* While copying data for the last block of the epoch where resharding occured, the `DBCol::StateShardUIdMapping` column is copied into cold storage. This ensures that mappings are updated alongside the shard state data. +* While copying data for the last block of the epoch where resharding occurred, the `DBCol::StateShardUIdMapping` column is copied into cold storage. This ensures that mappings are updated alongside the shard state data. * These mappings are permanent in cold storage, aligning with its role in preserving historical state. This approach minimizes complexity while maintaining consistency across hot and cold storage. ### Stateless Validation -As only a fraction of nodes track the split shard, there is a need to prove the transition from state root of parent shard -to new state roots for children shards to other validators. -Otherwise the chunk producers for split shard may collude and provide invalid state roots, -which may compromise the protocol, for example, with minting tokens out of thin air. +Since only a fraction of nodes track the split shard, it is necessary to prove the transition from the state root of the parent shard to the new state roots for the child shards to other validators. Without this proof, chunk producers for the split shard could collude and provide invalid state roots, potentially compromising the protocol, such as by minting tokens out of thin air. + +The design ensures that generating and verifying this state transition is negligible in time compared to applying a chunk. As detailed in the [State Storage - MemTrie](#state-storage---memtrie) section, the generation and verification logic involves a constant number of trie lookups. Specifically, we implement the `retain_split_shard(boundary_account, RetainMode::{Left, Right})` method for the trie, which retains only the keys in the trie that belong to the left or right child shard. Internally, this method uses `retain_multi_range(intervals)`, where `intervals` is a vector of trie key intervals to retain. Each interval corresponds to a unique trie key type prefix byte (`Account`, `AccessKey`, etc.) and defines an interval from the empty key to the `boundary_account` key for the left shard, or from the `boundary_account` to infinity for the right shard. -The design allows to generate and check this state transition in the time, negligible compared to the time it takes to apply chunk. -As shown above in [State Storage - MemTrie](#state-storage---memtrie) section, generation and verification logic consists of constant number of trie lookups. -More specifically, we implement `retain_split_shard(boundary_account, RetainMode::{Left, Right})` method for trie, which leaves only keys in trie that -belong to the left or right child shard. -Inside, we implement `retain_multi_range(intervals)` method, where `intervals` is a vector of trie key intervals to retain. -Each interval corresponds to unique trie key type prefix byte (`Account`, `AccessKey`, etc.) and either defines an interval from empty key to `boundary_account` key for left shard, or from `boundary_account` to infinity for right shard. -`retain_multi_range` is recursive. Based on current trie key prefix covered by current node, it either: +The `retain_multi_range` method is recursive. Based on the current trie key prefix covered by the current node, it either: -* returns node back, if subtree is fully contained within some interval; -* returns "empty" node, if subtree is outside of all intervals; -* otherwise, descends into all children and constructs new node with children returned by recursive calls. +* Returns the node if the subtree is fully contained within an interval. +* Returns an "empty" node if the subtree is outside all intervals. +* Descends into all children and constructs a new node with children returned by recursive calls. -Implementation is agnostic to the trie storage used for retrieving nodes, it applies to both memtrie and partial storage (state proof). +This implementation is agnostic to the trie storage used for retrieving nodes and applies to both memtrie and partial storage (state proof). -* calling it for memtrie generates a proof and new state root; -* calling it for partial storage generates a new state root. If method doesn't fail with error that node wasn't found in the proof, it means that proof was sufficient, and it remains to compare generated state root with the one proposed by chunk producer. +* Calling it for memtrie generates a proof and a new state root. +* Calling it for partial storage generates a new state root. If the method does not fail with an error indicating that a node was not found in the proof, it means the proof was sufficient, and it remains to compare the generated state root with the one proposed by the chunk producer. ### State Witness -Resharding state transition becomes one of `implicit_transitions` in `ChunkStateWitness`. It must be validated between processing last chunk (potentially missing) in the old epoch and the first chunk (potentially missing) in the new epoch. `ChunkStateTransition` fields also nicely correspond to the resharding state transition: in `block_hash` we store the hash of the last block of the parent shard, in `base_state` we store the resharding proof, and in `post_state_root` we store the proposed state root. +The resharding state transition becomes one of the `implicit_transitions` in `ChunkStateWitness`. It must be validated between processing the last chunk (potentially missing) in the old epoch and the first chunk (potentially missing) in the new epoch. The `ChunkStateTransition` fields correspond to the resharding state transition: the `block_hash` stores the hash of the last block of the parent shard, the `base_state` stores the resharding proof, and the `post_state_root` stores the proposed state root. -Note that it leads to **two** state transitions corresponding to the same block hash. On the chunk producer side, the first transition is stored for the `(block_hash, parent_shard_uid)` pair and the second one is stored for the `(block_hash, child_shard_uid)` pair. +This results in **two** state transitions corresponding to the same block hash. On the chunk producer side, the first transition is stored for the `(block_hash, parent_shard_uid)` pair, and the second one is stored for the `(block_hash, child_shard_uid)` pair. -The chunk validator has all the blocks, so it identifies whether implicit transition corresponds to applying missing chunk or resharding independently. This is implemented in `get_state_witness_block_range`, which iterates from `state_witness.chunk_header.prev_block_hash()` to the block with includes last last chunk for the (parent) shard, if it exists. +The chunk validator, having all the blocks, identifies whether the implicit transition corresponds to applying a missing chunk or resharding independently. This is implemented in `get_state_witness_block_range`, which iterates from `state_witness.chunk_header.prev_block_hash()` to the block that includes the last chunk for the (parent) shard, if it exists. -Then, on `validate_chunk_state_witness`, if implicit transition corresponds to resharding, chunk validator calls `retain_split_shard` and proves state transition from parent to child shard. +Then, in `validate_chunk_state_witness`, if the implicit transition corresponds to resharding, the chunk validator calls `retain_split_shard` and proves the state transition from the parent to the child shard. ### State Sync -Changes to the state sync protocol aren't typically conisdered protocol changes requiring a version bump, since it's concerned with downloading state that isn't present locally, rather than with the rules of execution of blocks and chunks. But it might still be helpful to outline some planned changes to state sync related to resharding. +Changes to the state sync protocol are not typically considered protocol changes requiring a version bump, as they concern downloading state that is not present locally rather than the rules for executing blocks and chunks. However, it is helpful to outline some planned changes to state sync related to resharding. -When nodes sync state (either because they've fallen far behind the chain, or because they're going to become a chunk producer for a new shard in a future epoch), they first identify a point in the chain they'd like to sync to. Then they download the tries corresponding to that point in the chain and apply all chunks from that point until they're caught up. Currently, the tries downloaded initially are those corresponding to the `prev_state_root` field of the last new chunk before the current epoch's first block. This means the state downloaded is the state at some point in the previous epoch. +When nodes sync state (either because they have fallen far behind the chain or because they will become a chunk producer for a new shard in a future epoch), they first identify a point in the chain to sync to. They then download the tries corresponding to that point in the chain and apply all chunks from that point until they are caught up. Currently, the tries downloaded initially correspond to the `prev_state_root` field of the last new chunk before the first block of the current epoch. This means the state downloaded is from some point in the previous epoch. -The change we propose is to move the initial state download point to one in the current epoch rather than the previous. This keeps shard IDs consistent throughout the state sync logic, allows some simplification in the resharding implementation, and reduces the size of the state we need to download. Suppose that the previous epoch's shard `S` was split into shards `S'` and `S''` in the current epoch, and that a chunk producer that wasn't tracking shard `S` or any of its children in the current epoch will become a chunk producer for `S'` in the next epoch. Then with the old state sync algorithm, that chunk producer would download the pre-split state for shard `S`. Then when it's done, it would need to perform the resharding that all the other nodes had already done. This isn't a correctness issue, but it simplifies the implementation somewhat if we instead download only the state for shard `S'`, and it allows the node to download only the state belonging to `S'`, which is much smaller. +The proposed change is to move the initial state download point to one in the current epoch rather than the previous one. This keeps shard IDs consistent throughout the state sync logic, simplifies the resharding implementation, and reduces the size of the state to be downloaded. Suppose the previous epoch's shard `S` was split into shards `S'` and `S''` in the current epoch, and a chunk producer that was not tracking shard `S` or any of its children in the current epoch will become a chunk producer for `S'` in the next epoch. With the old state sync algorithm, that chunk producer would download the pre-split state for shard `S`. Then, when it is done, it would need to perform the resharding that all other nodes had already done. While this is not a correctness issue, it simplifies the implementation if we instead download only the state for shard `S'`, allowing the node to download only the state belonging to `S'`, which is much smaller. -### Cross Shard Traffic +### Cross-Shard Traffic -When the shard layout changes, it is crucial to handle cross-shard traffic -correctly, especially in the presence of missing chunks. Care must be taken to -ensure that no receipt is lost or duplicated. There are two important receipt -types that need to be considered - the outgoing receipts and the incoming -receipts. +When the shard layout changes, it is crucial to handle cross-shard traffic correctly, especially in the presence of missing chunks. Care must be taken to ensure that no receipt is lost or duplicated. There are two important receipt types that need to be considered: outgoing receipts and incoming receipts. -Note - this proposal reuses the approach taken by ReshardingV2. +Note: This proposal reuses the approach taken by Resharding V2. #### Outgoing Receipts -Each new chunk in a shard contains a list of outgoing receipts generated during -the processing of the previous chunk in that shard. +Each new chunk in a shard contains a list of outgoing receipts generated during the processing of the previous chunk in that shard. -In cases where chunks are missing at the resharding boundary, both child shards -could theoretically include the outgoing receipts from their shared ancestor -chunk. However, this naive approach would lead to the duplication of receipts, -which must be avoided. +In cases where chunks are missing at the resharding boundary, both child shards could theoretically include the outgoing receipts from their shared ancestor chunk. However, this naive approach would lead to the duplication of receipts, which must be avoided. -The proposed solution is to reassign the outgoing receipts from the parent chunk -to only one of the child shards. Specifically, the child shard with the lower -shard ID will claim all outgoing receipts from the parent, while the other child -will receive none. This ensures that all receipts are processed exactly once. +The proposed solution is to reassign the outgoing receipts from the parent chunk to only one of the child shards. Specifically, the child shard with the lower shard ID will claim all outgoing receipts from the parent, while the other child will receive none. This ensures that all receipts are processed exactly once. #### Incoming Receipts -To process a chunk in a shard, it is necessary to gather all outgoing receipts -from other shards that are targeted at this shard. These receipts must then be -included as incoming receipts. +To process a chunk in a shard, it is necessary to gather all outgoing receipts from other shards that are targeted at this shard. These receipts must then be included as incoming receipts. -In the presence of missing chunks, the new chunk must collect receipts from all -previous blocks, spanning the period since the last new chunk in this shard. -This range may cross the resharding boundary. +In the presence of missing chunks, the new chunk must collect receipts from all previous blocks, spanning the period since the last new chunk in this shard. This range may cross the resharding boundary. -When this occurs, the chunk must also consider receipts that were previously -targeted at its parent shard. However, it must filter these receipts to include -only those where the recipient lies within the current shard, discarding those -where the recipient belongs to the sibling shard in the new shard layout. This -filtering process ensures that every receipt is processed exactly once and in -the correct shard. +When this occurs, the chunk must also consider receipts that were previously targeted at its parent shard. However, it must filter these receipts to include only those where the recipient lies within the current shard, discarding those where the recipient belongs to the sibling shard in the new shard layout. This filtering process ensures that every receipt is processed exactly once and in the correct shard. ### Delayed Receipt Handling -The delayed receipts queue contains all incoming receipts that could not be executed as part of a block due to resource constraints like compute cost, gas limits etc. The entries in the delayed receipt queue can belong to any of the accounts as part of the shard. During a resharding event, we ideally need to split the delayed receipts across both the child shards according to the associated account_id with the receipt. +The delayed receipts queue contains all incoming receipts that could not be executed as part of a block due to resource constraints like compute cost, gas limits, etc. The entries in the delayed receipt queue can belong to any of the accounts within the shard. During a resharding event, we ideally need to split the delayed receipts across both child shards according to the associated account_id with the receipt. -The singleton trie key `DelayedReceiptIndices` holds the start_index and end_index associated with the delayed receipt entries for the shard. The trie key `DelayedReceipt { index }` contains the actual delayed receipt associated with some account_id. These are processed in a fifo queue order during chunk execution. +The singleton trie key `DelayedReceiptIndices` holds the start_index and end_index associated with the delayed receipt entries for the shard. The trie key `DelayedReceipt { index }` contains the actual delayed receipt associated with some account_id. These are processed in a FIFO queue order during chunk execution. -Note that the delayed receipt trie keys do not have the `account_id` prefix. In ReshardingV2, we followed the trivial solution of iterating through all the delayed receipt queue entries and assigning them to the appropriate child shard, however due to constraints on the state witness size limits and instant resharding, this approach is no longer feasible for ReshardingV3. +Note that the delayed receipt trie keys do not have the `account_id` prefix. In Resharding V2, we followed the trivial solution of iterating through all the delayed receipt queue entries and assigning them to the appropriate child shard. However, due to constraints on the state witness size limits and instant resharding, this approach is no longer feasible for Resharding V3. -For ReshardingV3, we decided to handle the resharding by duplicating the entries of the delayed receipt queue across both the child shards. This is great from the perspective of state witness size and instant resharding as we only need to access the delayed receipt queue root entry in the trie, however it breaks the assumption that all delayed receipts in a shard belong to the accounts within that shard. +For Resharding V3, we decided to handle the resharding by duplicating the entries of the delayed receipt queue across both child shards. This is beneficial from the perspective of state witness size and instant resharding as we only need to access the delayed receipt queue root entry in the trie. However, it breaks the assumption that all delayed receipts in a shard belong to the accounts within that shard. -To resolve this, with the new protocol version, we changed the implementation of runtime to discard executing delayed receipts that don't belong to the account_id on that shard. +To resolve this, with the new protocol version, we changed the implementation of the runtime to discard executing delayed receipts that don't belong to the account_id on that shard. -Note that no delayed receipts are lost during resharding as all receipts get executed exactly once based on which of the child shards does the associated account_id belong to. +Note that no delayed receipts are lost during resharding as all receipts get executed exactly once based on which of the child shards the associated account_id belongs to. ### PromiseYield Receipt Handling -Promise Yield were introduced as part of NEP-519 to enable defer replying to caller function while response is being prepared. As part of Promise Yield implementation, it introduced three new trie keys, `PromiseYieldIndices`, `PromiseYieldTimeout` and `PromiseYieldReceipt`. +Promise Yield was introduced as part of NEP-519 to enable deferring replies to the caller function while the response is being prepared. As part of the Promise Yield implementation, it introduced three new trie keys: `PromiseYieldIndices`, `PromiseYieldTimeout`, and `PromiseYieldReceipt`. -* `PromiseYieldIndices`: This is a singleton key that holds the start_index and end_index of the keys in `PromiseYieldTimeout` -* `PromiseYieldTimeout { index }`: Along with the receiver_id and data_id, this stores the expires_at block height till which we need to wait to receive a response. +* `PromiseYieldIndices`: This is a singleton key that holds the start_index and end_index of the keys in `PromiseYieldTimeout`. +* `PromiseYieldTimeout { index }`: Along with the receiver_id and data_id, this stores the expires_at block height until which we need to wait to receive a response. * `PromiseYieldReceipt { receiver_id, data_id }`: This is the receipt created by the account. An account can call the `promise_yield_create` host function that increments the `PromiseYieldIndices` along with adding a new entry into the `PromiseYieldTimeout` and `PromiseYieldReceipt`. The `PromiseYieldTimeout` is sorted as per the time of creation and has an increasing value of expires_at block height. In the runtime, we iterate over all the expired receipts and create a blank receipt to resolve the entry in `PromiseYieldReceipt`. -The account can call the `promise_yield_resume` host function multiple times and if this is called before the expiry period, we use this to resolve the promise yield receipt. Note that the implementation allows for multiple resolution receipts to be created, including the expiry receipt, but only the first one is used for the actual resolution of the promise yield receipt. +The account can call the `promise_yield_resume` host function multiple times, and if this is called before the expiry period, we use this to resolve the promise yield receipt. Note that the implementation allows for multiple resolution receipts to be created, including the expiry receipt, but only the first one is used for the actual resolution of the promise yield receipt. -We use this implementation quirk to facilitate resharding implementation. The resharding strategy for the three trie keys are: +We use this implementation quirk to facilitate the resharding implementation. The resharding strategy for the three trie keys is: * `PromiseYieldIndices`: Duplicate across both child shards. * `PromiseYieldTimeout { index }`: Duplicate across both child shards. * `PromiseYieldReceipt { receiver_id, data_id }`: Since this key has the account_id prefix, we can split the entries across both child shards based on the prefix. -After duplication of the `PromiseYieldIndices` and `PromiseYieldTimeout`, when the entries of `PromiseYieldTimeout` eventually get dequeued at the expiry height of the following happens: +After duplication of the `PromiseYieldIndices` and `PromiseYieldTimeout`, when the entries of `PromiseYieldTimeout` eventually get dequeued at the expiry height, the following happens: -* If the promise yield receipt associated with the dequeued entry IS NOT a part of the child trie, we create a timeout resolution receipt and it gets ignored. +* If the promise yield receipt associated with the dequeued entry IS NOT part of the child trie, we create a timeout resolution receipt, and it gets ignored. * If the promise yield receipt associated with the dequeued entry IS part of the child trie, the promise yield implementation continues to work as expected. -This means we don't have to make any special changes in the runtime for handling resharding of promise yield receipts. +This means we don't have to make any special changes in the runtime for handling the resharding of promise yield receipts. ### Buffered Receipt Handling -Buffered Receipts were introduced as part of NEP-539, cross-shard congestion control. As part of the implementation, it introduced two new trie keys, `BufferedReceiptIndices` and `BufferedReceipt`. +Buffered Receipts were introduced as part of NEP-539 for cross-shard congestion control. As part of the implementation, it introduced two new trie keys: `BufferedReceiptIndices` and `BufferedReceipt`. * `BufferedReceiptIndices`: This is a singleton key that holds the start_index and end_index of the keys in `BufferedReceipt` for each shard_id. * `BufferedReceipt { receiving_shard, index }`: This holds the actual buffered receipt that needs to be sent to the receiving_shard. -Note that the targets of the buffered receipts belong to external shards and during a resharding event, we would need to handle both, the set of buffered receipts in the parent shard, as well as the set of buffered receipts in other shards that target the parent shard. +Note that the targets of the buffered receipts belong to external shards, and during a resharding event, we would need to handle both the set of buffered receipts in the parent shard and the set of buffered receipts in other shards that target the parent shard. -#### Handling buffered receipts in parent shard +#### Handling Buffered Receipts in Parent Shard -Since buffered receipts target external shards, it is fine to assign buffered receipts to either of the child shards. For simplicity, we assign all the buffered receipts to the child shard with the lower index, i.e. copy `BufferedReceiptIndices` and `BufferedReceipt` to the child shard with lower index while keeping `BufferedReceiptIndices` as empty for child shard with higher index. +Since buffered receipts target external shards, it is fine to assign buffered receipts to either of the child shards. For simplicity, we assign all the buffered receipts to the child shard with the lower index, i.e., copy `BufferedReceiptIndices` and `BufferedReceipt` to the child shard with the lower index while keeping `BufferedReceiptIndices` empty for the child shard with the higher index. -#### Handling buffered receipts that target parent shard +#### Handling Buffered Receipts that Target Parent Shard -This scenario is slightly more complex to deal with. At the boundary of resharding, we may have buffered receipts created before the resharding event targeting the parent shard. At the same time, we may also have buffered receipts that are generated after the resharding event that directly target the child shard. The receipts from both parent and child buffered receipts queue need to appropriately sent to the child shard depending on the account_id while respecting the outgoing limits calculated by bandwidth scheduler and congestion control. +This scenario is slightly more complex to deal with. At the boundary of resharding, we may have buffered receipts created before the resharding event targeting the parent shard. At the same time, we may also have buffered receipts that are generated after the resharding event that directly target the child shard. The receipts from both the parent and child buffered receipts queue need to be appropriately sent to the child shard depending on the account_id while respecting the outgoing limits calculated by the bandwidth scheduler and congestion control. -The flow of handling buffered receipts before ReshardingV3 is as follows: +The flow of handling buffered receipts before Resharding V3 is as follows: 1. Calculate `outgoing_limit` for each shard. 2. For each shard, try and forward as many in-order receipts as possible from the buffer while respecting `outgoing_limit`. -3. Apply chunk and `try_forward` newly generated receipts. The newly generated receipts are forwarded if we have enough limit else they are put in the buffered queue. +3. Apply chunk and `try_forward` newly generated receipts. The newly generated receipts are forwarded if we have enough limit; otherwise, they are put in the buffered queue. -The solution for ReshardingV3 is to first try draining the parent queue before moving onto draining the child queue. The modified flow would look something like: +The solution for Resharding V3 is to first try draining the parent queue before moving on to draining the child queue. The modified flow would look something like this: -1. Calculate `outgoing_limit` for both the child shards using congestion info from parent. -2. Forwarding receipts - * First try forward as many in-order receipts as possible from parent shard buffer. Stop either when we drain the parent buffer or as soon as we exhaust the `outgoing_limit` of either of the children shards. - * Next try forward as many in-order receipts as possible from child shard buffer. +1. Calculate `outgoing_limit` for both child shards using congestion info from the parent. +2. Forwarding receipts: + * First, try to forward as many in-order receipts as possible from the parent shard buffer. Stop either when we drain the parent buffer or as soon as we exhaust the `outgoing_limit` of either of the child shards. + * Next, try to forward as many in-order receipts as possible from the child shard buffer. 3. Apply chunk and `try_forward` newly generated receipts remains the same. -The minor downside to this approach is that we don't have guarantees between order of receipt generation and order of receipt forwarding, but that's anyway the case today with buffered receipts. +The minor downside to this approach is that we don't have guarantees between the order of receipt generation and the order of receipt forwarding, but that's anyway the case today with buffered receipts. ### Congestion Control @@ -343,26 +262,26 @@ pub struct CongestionInfoV1 { } ``` -After a resharding event, we need to properly initialize the congestion info for the child shards. Here's how we handle each of the fields +After a resharding event, it is essential to properly initialize the congestion info for the child shards. Here is how each field is handled: #### delayed_receipts_gas -Since the resharding strategy for delayed receipts is to duplicate them across both the child shards, we simply copy the value of `delayed_receipts_gas` across both shards. +Since the resharding strategy for delayed receipts is to duplicate them across both child shards, we simply copy the value of `delayed_receipts_gas` to both shards. #### buffered_receipts_gas -Since the resharding strategy for buffered receipts is to assign all the buffered receipts to the lower index child, we copy the `buffered_receipts_gas` from parent to lower index child and set `buffered_receipts_gas` to zero for upper index child. +Given that the strategy for buffered receipts is to assign all buffered receipts to the lower index child, we copy the `buffered_receipts_gas` from the parent to the lower index child and set `buffered_receipts_gas` to zero for the upper index child. #### receipt_bytes -This field is harder to deal with as it contains the information from both delayed receipts and buffered receipts. To calculate this field properly, we would need the distribution of the receipt_bytes across both delayed receipts and buffered receipts. The current solution is to start storing metadata about the total `receipt_bytes` for buffered receipts in the trie. This way we have the following: +This field is more complex as it includes information from both delayed receipts and buffered receipts. To calculate this field accurately, we need to know the distribution of `receipt_bytes` across both delayed receipts and buffered receipts. The current solution is to store metadata about the total `receipt_bytes` for buffered receipts in the trie. This way, we have the following: -* For child with lower index, receipt_bytes is the sum of both delayed receipts bytes and congestion control bytes, hence `receipt_bytes = parent.receipt_bytes` -* For child with upper index, receipt_bytes is just the bytes from delayed receipts, hence `receipt_bytes = parent.receipt_bytes - buffered_receipt_bytes` +* For the child with the lower index, `receipt_bytes` is the sum of both delayed receipts bytes and congestion control bytes, hence `receipt_bytes = parent.receipt_bytes`. +* For the child with the upper index, `receipt_bytes` is just the bytes from delayed receipts, hence `receipt_bytes = parent.receipt_bytes - buffered_receipt_bytes`. #### allowed_shard -This field is calculated by a round-robin mechanism which can be independently calculated for both the child shards. Since we are changing the [ShardId semantics](#shardid-semantics), we need to change implementation to use `ShardIndex` instead of `ShardID` which is just an assignment for each shard_id to the contiguous index `[0, num_shards)`. +This field is calculated using a round-robin mechanism, which can be independently determined for both child shards. Since we are changing the [ShardId semantics](#shardid-semantics), we need to update the implementation to use `ShardIndex` instead of `ShardID`, which is simply an assignment for each shard_id to the contiguous index `[0, num_shards)`. ### ShardId Semantics @@ -427,81 +346,72 @@ If the whole `ChunkStateWitness` is valid, then chunk validator sends endorsemen ### State Storage - MemTrie -The current implementation of MemTrie uses a pool of memory (`STArena`) to allocate and deallocate nodes and internal pointers in this pool to reference child nodes. MemTries, unlike the State representation of Trie, do not work with the hash of the nodes but internal memory pointers directly. Additionally, MemTries are not thread safe and one MemTrie exists per shard. +The current implementation of MemTrie uses a memory pool (`STArena`) to allocate and deallocate nodes, with internal pointers in this pool referencing child nodes. Unlike the State representation of Trie, MemTries do not work with node hashes but with internal memory pointers directly. Additionally, MemTries are not thread-safe, and one MemTrie exists per shard. -As described in [MemTrie](#state-storage---memtrie) section above, we need an efficient way to split the MemTrie into two child MemTries within a span of 1 block. What makes this challenging is that the current implementation of MemTrie is not thread safe and can not be shared across two shards. +As described in the [MemTrie](#state-storage---memtrie) section above, we need an efficient way to split the MemTrie into two child MemTries within the span of one block. The challenge lies in the current implementation of MemTrie, which is not thread-safe and cannot be shared across two shards. -The naive way to create two MemTries for the child shards would be to iterate through all the entries of the parent MemTrie and fill in these values into the child MemTries. This however is prohibitively time consuming. +A naive approach to creating two MemTries for the child shards would involve iterating through all entries of the parent MemTrie and populating these values into the child MemTries. However, this method is prohibitively time-consuming. -The solution to this problem was to introduce the concept of Frozen MemTrie (with a `FrozenArena`) which is a cloneable, read-only, thread-safe snapshot of a MemTrie. We can call the `freeze` method on an existing MemTrie that converts it into a Frozen MemTrie. Note that this process consumes the original MemTrie and we can no longer allocate and deallocate nodes to it. +The solution to this problem is to introduce the concept of a Frozen MemTrie (with a `FrozenArena`), which is a cloneable, read-only, thread-safe snapshot of a MemTrie. By calling the `freeze` method on an existing MemTrie, we convert it into a Frozen MemTrie. This process consumes the original MemTrie, making it no longer available for node allocation and deallocation. -Along with `FrozenArena`, we also introduce a `HybridArena` which is effectively a base made of `FrozenArena` with a top layer of `STArena` where we support allocating and deallocating new nodes into the MemTrie. Newly allocated nodes can reference/point to nodes in the `FrozenArena`. We use this Hybrid MemTrie as a temporary MemTrie while the flat storage is being constructed in the background. +Along with `FrozenArena`, we also introduce a `HybridArena`, which effectively combines a base `FrozenArena` with a top layer of `STArena` that supports allocating and deallocating new nodes into the MemTrie. Newly allocated nodes can reference nodes in the `FrozenArena`. This Hybrid MemTrie serves as a temporary MemTrie while the flat storage is being constructed in the background. -While Frozen MemTries provide the benefits of being compatible with instant resharding, they come at the cost of memory consumption. Once a MemTrie is frozen, since it doesn't support deallocation of memory, it continues to consume as much memory as it did at the time of freezing. In case a node is tracking only one of the child shards, a Frozen MemTrie would continue to use the same amount of memory as the parent trie. Due to this, Hybrid MemTries are only a temporary solution and we rebuild the MemTrie for the children once the post-processing step for Flat Storage is completed. +While Frozen MemTries facilitate instant resharding, they come at the cost of memory consumption. Once a MemTrie is frozen, it continues to consume the same amount of memory as it did at the time of freezing, as it does not support memory deallocation. If a node tracks only one of the child shards, a Frozen MemTrie would continue to use the same amount of memory as the parent trie. Therefore, Hybrid MemTries are only a temporary solution, and we rebuild the MemTrie for the children once the post-processing step for Flat Storage is completed. -Additionally, a node would have to support 2x the memory footprint of a single trie as after resharding, we would have two copies of the trie in memory, one from the temporary Hybrid MemTrie in use for block production, and other from the background MemTrie that would be under construction. Once the background MemTrie is fully constructed and caught up with the latest block, we do an in-place swap of the Hybrid MemTrie with the new child MemTrie and deallocate the memory from the Hybrid MemTrie. +Additionally, a node would need to support twice the memory footprint of a single trie. After resharding, there would be two copies of the trie in memory: one from the temporary Hybrid MemTrie used for block production and another from the background MemTrie under construction. Once the background MemTrie is fully constructed and caught up with the latest block, we perform an in-place swap of the Hybrid MemTrie with the new child MemTrie and deallocate the memory from the Hybrid MemTrie. -During a resharding event, at the boundary of the epoch, when we need to split the parent shard into the two child shards, we do the following steps: +During a resharding event at the epoch boundary, when we need to split the parent shard into two child shards, we follow these steps: -1. Freeze the parent MemTrie arena to create a read-only frozen arena that represents a snapshot of the state as of the time of freezing, i.e. after postprocessing last block of epoch. Note that we no longer require the parent MemTrie in runtime going forward. -2. We cheaply clone the Frozen MemTrie for both the child MemTries to use. Note that this doesn't clone the parent arena memory, but just increases the refcount. -3. We then create a new MemTrie with HybridArena for each of the children. The base of the MemTrie is the read-only FrozenArena while all new node allocations happens on a dedicated STArena memory pool for each child MemTrie. This is the temporary MemTrie that we use while Flat Storage is being built in the background. -4. Once the Flat Storage is constructed in the post processing step of resharding, we use that to load a new MemTrie and catchup to the latest block. -5. After the new child MemTrie has caught up to the latest block, we do an in-place swap in Client and discard the Hybrid MemTrie. +1. Freeze the parent MemTrie arena to create a read-only frozen arena representing a snapshot of the state at the time of freezing, i.e., after post-processing the last block of the epoch. The parent MemTrie is no longer required in runtime going forward. +2. Clone the Frozen MemTrie cheaply for both child MemTries to use. This does not clone the parent arena memory but merely increases the reference count. +3. Create a new MemTrie with HybridArena for each child. The base of the MemTrie is the read-only FrozenArena, while all new node allocations occur in a dedicated STArena memory pool for each child MemTrie. This temporary MemTrie is used while Flat Storage is being built in the background. +4. Once the Flat Storage is constructed in the post-processing step of resharding, we use it to load a new MemTrie and catch up to the latest block. +5. After the new child MemTrie has caught up to the latest block, we perform an in-place swap in the Client and discard the Hybrid MemTrie. ![Hybrid MemTrie diagram](assets/nep-0568/NEP-HybridMemTrie.png) ### State Storage - Flat State -Resharding Flat State is a time consuming operation and it runs in parallel with block processing for several block heights. -Thus, there are a few important aspects to consider during implementation: +Resharding Flat State is a time-consuming operation that runs in parallel with block processing for several block heights. Therefore, several important aspects must be considered during implementation: -* Flat State's own status should be resilient to application crashes. +* Flat State's status should be resilient to application crashes. * The parent shard's Flat State should be split at the correct block height. -* New shards' Flat States should eventually converge to same representation the chain is using to process blocks (MemTries). +* New shards' Flat States should eventually converge to the same representation the chain uses to process blocks (MemTries). * Resharding should work correctly in the presence of chain forks. -* Retired shards are cleaned up. +* Retired shards should be cleaned up. -Note that the Flat States of the newly created shards won't be available until resharding is completed. This is fine because the temporary MemTries are -built instantly and they can satisfy all block processing needs. +Note that the Flat States of the newly created shards will not be available until resharding is completed. This is acceptable because the temporary MemTries are built instantly and can satisfy all block processing needs. -The main component responsible to carry out resharding on Flat State is the [FlatStorageResharder](https://github.com/near/nearcore/blob/f4e9dd5d6e07089dfc789221ded8ec83bfe5f6e8/chain/chain/src/flat_storage_resharder.rs#L68). +The main component responsible for carrying out resharding on Flat State is the [FlatStorageResharder](https://github.com/near/nearcore/blob/f4e9dd5d6e07089dfc789221ded8ec83bfe5f6e8/chain/chain/src/flat_storage_resharder.rs#L68). -#### Flat State's status persistence +#### Flat State's Status Persistence -Every shard Flat State has a status associated to it and stored in the database, called `FlatStorageStatus`. We propose to extend the existing object -by adding the new enum variant named `FlatStorageStatus::Resharding`. This approach has two benefits. First, the progress of any Flat State resharding is -persisted to disk, which makes the operation resilient to a node crash or restart. Second, resuming resharding on node restart shares the same code path as Flat -State creation (see `FlatStorageShardCreator`), reducing the code duplication factor. +Every shard Flat State has a status associated with it and stored in the database, called `FlatStorageStatus`. We propose extending the existing object by adding a new enum variant named `FlatStorageStatus::Resharding`. This approach has two benefits. First, the progress of any Flat State resharding is persisted to disk, making the operation resilient to a node crash or restart. Second, resuming resharding on node restart shares the same code path as Flat State creation (see `FlatStorageShardCreator`), reducing code duplication. -`FlatStorageStatus` is updated at every committable step of resharding. The commit points are the following: +`FlatStorageStatus` is updated at every committable step of resharding. The commit points are as follows: -* Beginning of resharding or, in other words, the last block of the old shard layout. +* Beginning of resharding, or the last block of the old shard layout. * Scheduling of the _"split parent shard"_ task. -* Execution, cancellation or failure of the _"split parent shard"_ task. +* Execution, cancellation, or failure of the _"split parent shard"_ task. * Execution or failure of any _"child catchup"_ task. -#### Splitting a shard Flat State +#### Splitting a Shard Flat State -When, at the end of an epoch, the shard layout changes we identify a so called _resharding block_ that corresponds to the last block of the current epoch. -A task to split the parent shard's Flat State is scheduled to happen after the _resharding block_ becomes final. The reason to wait for the finality condition -is to avoid a split on a block that might be excluded from the canonical chain; needless to say, such situation would lock the node -into an erroneous state. +When the shard layout changes at the end of an epoch, we identify a _resharding block_ corresponding to the last block of the current epoch. A task to split the parent shard's Flat State is scheduled to occur after the _resharding block_ becomes final. The finality condition is necessary to avoid splitting on a block that might be excluded from the canonical chain, which would lock the node into an erroneous state. -Inside the split task we iterate over the Flat State and copy each element into either child. This routine is performed in batches in order to lessen the performance -impact on the node. +Inside the split task, we iterate over the Flat State and copy each element into either child. This routine is performed in batches to lessen the performance impact on the node. -Finally, if the split completes successfully, the parent shard Flat State is removed from the database and the children Flat States enter a catch-up phase. +Finally, if the split completes successfully, the parent shard Flat State is removed from the database, and the children Flat States enter a catch-up phase. One current technical limitation is that, upon a node crash or restart, the _"split parent shard"_ task will start copying all elements again from the beginning. A reference implementation of splitting a Flat State can be found in [FlatStorageResharder::split_shard_task](https://github.com/near/nearcore/blob/fecce019f0355cf89b63b066ca206a3cdbbdffff/chain/chain/src/flat_storage_resharder.rs#L295). -#### Values assignment from parent to child shards +#### Values Assignment from Parent to Child Shards -Key-value pairs in the parent shard Flat State are inherited by children according to the rules stated below. +Key-value pairs in the parent shard Flat State are inherited by children according to the following rules: -Elements inherited by the child shard which tracks the `account_id` contained in the key: +Elements inherited by the child shard tracking the `account_id` contained in the key: * `ACCOUNT` * `CONTRACT_DATA` @@ -520,39 +430,34 @@ Elements inherited by both children: * `PROMISE_YIELD_TIMEOUT` * `BANDWIDTH_SCHEDULER_STATE` -Elements inherited only be the lowest index child: +Elements inherited only by the lowest index child: * `BUFFERED_RECEIPT_INDICES` * `BUFFERED_RECEIPT` -#### Bring children shards up to date with the chain's head +#### Bringing Children Shards Up to Date with the Chain's Head -Children shards' Flat States build a complete view of their content at the height of the `resharding block` sometime during the new epoch -after resharding. At that point in time many new blocks have been processed already, and these will most likely contain updates for the new shards. A catch-up step is necessary to apply all Flat State deltas accumulated until now. +Children shards' Flat States build a complete view of their content at the height of the `resharding block` sometime during the new epoch after resharding. At that point, many new blocks have already been processed, and these will most likely contain updates for the new shards. A catch-up step is necessary to apply all Flat State deltas accumulated until now. -This phase of resharding doesn't have to take extra steps to handle chain forks. On one hand, the catch-up task doesn't start until the parent shard -splitting is done, and at such point we know the `resharding block` is final; on the other hand, Flat State deltas are capable of handling forks automatically. +This phase of resharding does not require extra steps to handle chain forks. The catch-up task does not start until the parent shard splitting is done, ensuring the `resharding block` is final. Additionally, Flat State deltas can handle forks automatically. -The catch-up task commits to the database "batches" of Flat State deltas. If the application crashes or restarts the task will resume from where it left. +The catch-up task commits batches of Flat State deltas to the database. If the application crashes or restarts, the task will resume from where it left off. -Once all Flat State deltas are applied, the child shard's status is changed to `Ready` and clean up of Flat State deltas leftovers is performed. +Once all Flat State deltas are applied, the child shard's status is changed to `Ready`, and cleanup of Flat State deltas leftovers is performed. A reference implementation of the catch-up task can be found in [FlatStorageResharder::shard_catchup_task](https://github.com/near/nearcore/blob/fecce019f0355cf89b63b066ca206a3cdbbdffff/chain/chain/src/flat_storage_resharder.rs#L564). -#### Failure of Flat State resharding +#### Failure of Flat State Resharding -In the current proposal any failure during Flat State resharding is considered non-recoverable. -`neard` will attempt resharding again on restart, but no automatic recovery is implemented. +In the current proposal, any failure during Flat State resharding is considered non-recoverable. `neard` will attempt resharding again on restart, but no automatic recovery is implemented. -### State Storage - State mapping +### State Storage - State Mapping -To enable efficient shard state management during resharding, Resharding V3 uses the `DBCol::ShardUIdMapping` column. -This mapping allows child shards to reference ancestor shard data, avoiding the need for immediate duplication of state entries. +To enable efficient shard state management during resharding, Resharding V3 uses the `DBCol::ShardUIdMapping` column. This mapping allows child shards to reference ancestor shard data, avoiding the need for immediate duplication of state entries. -#### Mapping application in adapters +#### Mapping Application in Adapters -The core of the mapping logic is applied in `TrieStoreAdapter` and `TrieStoreUpdateAdapter`, which act as layers over the general `Store` interface. -Here’s a breakdown of the key functions involved: +The core of the mapping logic is applied in `TrieStoreAdapter` and `TrieStoreUpdateAdapter`, which act as layers over the general `Store` interface. Here’s a breakdown of the key functions involved: * **Key resolution**: The `get_key_from_shard_uid_and_hash` function is central to determining the correct `ShardUId` for state access. @@ -584,33 +489,32 @@ Here’s a breakdown of the key functions involved: The `TrieStoreAdapter` and `TrieStoreUpdateAdapter` use `get_key_from_shard_uid_and_hash` to correctly resolve the key for both reads and writes. Example methods include: - ```rust - // In TrieStoreAdapter - pub fn get(&self, shard_uid: ShardUId, hash: &CryptoHash) -> Result, StorageError> { + ```rust + // In TrieStoreAdapter + pub fn get(&self, shard_uid: ShardUId, hash: &CryptoHash) -> Result, StorageError> { let key = get_key_from_shard_uid_and_hash(self.store, shard_uid, hash); self.store.get(DBCol::State, &key) - } + } - // In TrieStoreUpdateAdapter - pub fn increment_refcount_by( + // In TrieStoreUpdateAdapter + pub fn increment_refcount_by( &mut self, shard_uid: ShardUId, hash: &CryptoHash, data: &[u8], increment: NonZero, - ) { + ) { let key = get_key_from_shard_uid_and_hash(self.store, shard_uid, hash); self.store_update.increment_refcount_by(DBCol::State, key.as_ref(), data, increment); - } - ``` + } + ``` The `get` function retrieves data using the resolved `ShardUId` and key, while `increment_refcount_by` manages reference counts, ensuring correct tracking even when accessing data under an ancestor shard. -#### Mapping retention and cleanup +#### Mapping Retention and Cleanup -Mappings in `DBCol::ShardUIdMapping` persist as long as any descendant relies on an ancestor’s data. -To manage this, the `set_shard_uid_mapping` function in `TrieStoreUpdateAdapter` adds a new mapping during resharding: +Mappings in `DBCol::ShardUIdMapping` persist as long as any descendant relies on an ancestor’s data. To manage this, the `set_shard_uid_mapping` function in `TrieStoreUpdateAdapter` adds a new mapping during resharding: ```rust fn set_shard_uid_mapping(&mut self, child_shard_uid: ShardUId, parent_shard_uid: ShardUId) { @@ -622,59 +526,57 @@ fn set_shard_uid_mapping(&mut self, child_shard_uid: ShardUId, parent_shard_uid: } ``` -When a node stops tracking all descendants of a shard, the associated mapping entry can be removed, allowing RocksDB to perform garbage collection. -For archival nodes, mappings are retained permanently to ensure access to the historical state of all shards. +When a node stops tracking all descendants of a shard, the associated mapping entry can be removed, allowing RocksDB to perform garbage collection. For archival nodes, mappings are retained permanently to ensure access to the historical state of all shards. -This implementation ensures efficient and scalable shard state transitions, -allowing child shards to use ancestor data without creating redundant entries. +This implementation ensures efficient and scalable shard state transitions, allowing child shards to use ancestor data without creating redundant entries. ### State Sync -The state sync algorithm defines a `sync_hash` that is used in many parts of the implementation. This is always the first block of the current epoch, which the node should be aware of once it has synced headers to the current point in the chain. A node performing state sync first makes a request (currently to centralized storage on GCS, but in the future to other nodes in the network) for a `ShardStateSyncResponseHeader` corresponding to that `sync_hash` and the Shard ID of the shard it's interested in. Among other things, this header includes the last new chunk before `sync_hash` in the shard, and a `StateRootNode` with hash equal to that chunk's `prev_state_root` field. Then the node downloads (again from GCS, but in the future it'll be from other nodes) the nodes of the trie with that `StateRootNode` as its root. Afterwards, it applies new chunks in the shard until it's caught up. +The state sync algorithm defines a `sync_hash` used in many parts of the implementation. This is always the first block of the current epoch, which the node should be aware of once it has synced headers to the current point in the chain. A node performing state sync first makes a request (currently to centralized storage on GCS, but in the future to other nodes in the network) for a `ShardStateSyncResponseHeader` corresponding to that `sync_hash` and the Shard ID of the shard it's interested in. Among other things, this header includes the last new chunk before `sync_hash` in the shard and a `StateRootNode` with a hash equal to that chunk's `prev_state_root` field. Then the node downloads (again from GCS, but in the future from other nodes) the nodes of the trie with that `StateRootNode` as its root. Afterwards, it applies new chunks in the shard until it's caught up. - As described above, the state we download is the state in the shard after applying the second to last new chunk before `sync_hash`, which belongs to the previous epoch (since `sync_hash` is the first block of the new epoch). To move the point in the chain of the initial state download to the current epoch, we could either move the `sync_hash` forward or we could change the state sync protocol (perhaps changing the meaning of the `sync_hash` and the fields of the `ShardStateSyncResponseHeader`, or somehow changing these structures more significantly). The former is an easier first implementation, since it would not require any changes to the state sync protocol other than to the expected `sync_hash`. We would just need to move the `sync_hash` to a point far enough along in the chain so that the `StateRootNode` in the `ShardStateSyncResponseHeader` refers to state in the current epoch. Currently we plan on implementing it that way, but we may revisit making more extensive changes to the state sync protocol later. +As described above, the state we download is the state in the shard after applying the second to last new chunk before `sync_hash`, which belongs to the previous epoch (since `sync_hash` is the first block of the new epoch). To move the point in the chain of the initial state download to the current epoch, we could either move the `sync_hash` forward or change the state sync protocol (perhaps changing the meaning of the `sync_hash` and the fields of the `ShardStateSyncResponseHeader`, or somehow changing these structures more significantly). The former is an easier first implementation, as it would not require any changes to the state sync protocol other than to the expected `sync_hash`. We would just need to move the `sync_hash` to a point far enough along in the chain so that the `StateRootNode` in the `ShardStateSyncResponseHeader` refers to the state in the current epoch. Currently, we plan on implementing it that way, but we may revisit making more extensive changes to the state sync protocol later. ## Security Implications ### Fork Handling -In theory, it can happen that there will be more than one candidate block which finishes the last epoch with old shard layout. For previous implementations it didn't matter because resharding decision was made in the beginning previous epoch. Now, the decision is made on the epoch boundary, so the new implementation handles this case as well. +In theory, it can happen that there will be more than one candidate block which finishes the last epoch with the old shard layout. For previous implementations, it didn't matter because the resharding decision was made at the beginning of the previous epoch. Now, the decision is made on the epoch boundary, so the new implementation handles this case as well. ### Proof Validation -With single shard tracking, nodes can't independently validate new state roots after resharding, because they don't have state of shard being split. That's why we generate resharding proofs, whose generation and validation may be a new weak point. However, `retain_split_shard` is equivalent to constant number of lookups in the trie, so its overhead its negligible. Even if proof is invalid, it will only imply that `retain_split_shard` fails early, similarly to other state transitions. +With single shard tracking, nodes can't independently validate new state roots after resharding because they don't have the state of the shard being split. That's why we generate resharding proofs, whose generation and validation may be a new weak point. However, `retain_split_shard` is equivalent to a constant number of lookups in the trie, so its overhead is negligible. Even if the proof is invalid, it will only imply that `retain_split_shard` fails early, similarly to other state transitions. ## Alternatives -In the solution space which would keep blockchain stateful, we also considered an alternative to handle resharding through mechanism of `Receipts`. The workflow would be to: +In the solution space which would keep blockchain stateful, we also considered an alternative to handle resharding through the mechanism of `Receipts`. The workflow would be to: -* create empty `target_shard`, -* require `source_shard` chunk producers to create special `ReshardingReceipt(source_shard, target_shard, data)` where `data` would be an interval of key-value pairs in `source_shard` alongside with the proof, -* then, `target_shard` trackers and validators would process that receipt, validate the proof and insert the key-value pairs into the new shard. +* Create an empty `target_shard`. +* Require `source_shard` chunk producers to create special `ReshardingReceipt(source_shard, target_shard, data)` where `data` would be an interval of key-value pairs in `source_shard` alongside with the proof. +* Then, `target_shard` trackers and validators would process that receipt, validate the proof, and insert the key-value pairs into the new shard. -However, `data` would occupy most of the whole state witness capacity and introduce overhead of proving every single interval in `source_shard`. Moreover, approach to sync target shard "dynamically" also requires some form of catchup, which makes it much less feasible than chosen approach. +However, `data` would occupy most of the whole state witness capacity and introduce the overhead of proving every single interval in `source_shard`. Moreover, the approach to sync the target shard "dynamically" also requires some form of catchup, which makes it much less feasible than the chosen approach. -Another question is whether we should tie resharding to epoch boundaries. This would allow to come from resharding decision to completion much faster. But for that, we would need to: +Another question is whether we should tie resharding to epoch boundaries. This would allow us to come from the resharding decision to completion much faster. But for that, we would need to: -* agree if we should reshard in the middle of the epoch or allow "fast epoch completion" which has to be implemented, -* keep chunk producers tracking "spare shards" ready to receive items from split shards, -* on resharding event, implement specific form of state sync, on which source and target chunk producers would agree on new state roots offline, -* then, new state roots would be validated by chunk validators in the same fashion. +* Agree if we should reshard in the middle of the epoch or allow "fast epoch completion" which has to be implemented. +* Keep chunk producers tracking "spare shards" ready to receive items from split shards. +* On resharding event, implement a specific form of state sync, on which source and target chunk producers would agree on new state roots offline. +* Then, new state roots would be validated by chunk validators in the same fashion. -While it is much closer to Dynamic Resharding (below), it requires much more changes to the protocol. And the considered idea works very well as intermediate step to that, if needed. +While it is much closer to Dynamic Resharding (below), it requires many more changes to the protocol. And the considered idea works very well as an intermediate step to that, if needed. -## Future possibilities +## Future Possibilities * Dynamic Resharding - In this proposal, resharding is scheduled in advance and hardcoded within the neard binary. In the future, we aim to enable the chain to dynamically trigger and execute resharding autonomously, allowing it to adjust capacity automatically based on demand. * Fast Dynamic Resharding - In the Dynamic Resharding extension, the new shard layout is configured for the second upcoming epoch. This means that a full epoch must pass before the chain transitions to the updated shard layout. In the future, our goal is to accelerate this process by finalizing the previous epoch more quickly, allowing the chain to adopt the new layout as soon as possible. -* Shard Merging - In this proposal the only allowed resharding operation is shard splitting. In the future, we aim to enable shard merging, allowing underutilized shards to be combined with neighboring shards. This would allow the chain to free up resources and reallocate them where they are most needed. +* Shard Merging - In this proposal, the only allowed resharding operation is shard splitting. In the future, we aim to enable shard merging, allowing underutilized shards to be combined with neighboring shards. This would allow the chain to free up resources and reallocate them where they are most needed. ## Consequences ### Positive * The protocol is able to execute resharding even while only a fraction of nodes track the split shard. -* State for new shard layouts is computed in the matter of minutes instead of hours, thus ecosystem stability during resharding is greatly increased. As before, from the point of view of NEAR users it is instantaneous. +* State for new shard layouts is computed in a matter of minutes instead of hours, thus ecosystem stability during resharding is greatly increased. As before, from the point of view of NEAR users, it is instantaneous. ### Neutral @@ -682,7 +584,7 @@ N/A ### Negative -* The storage components need to handle additional complexity of controlling the shard layout change. +* The storage components need to handle the additional complexity of controlling the shard layout change. ### Backwards Compatibility