-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimations to reduce the total number of keys in the network #163
Comments
Ideas that don't seem to work for the canonical transaction index. The transaction ind Binary Search from tiered indexesThe idea is to store multiple tiers of indexes, each which allow you to perform a sort of binary search. We might have an index for each million blocks, and then successively smaller chunks. To find a the entry for a given transaction hash you would scan the top level indices, and then work down to the lower level more granular indices as you find the index containing the transaction you need. This approach suffers from a few problems. First, those hosting the top level indexes would be constantly slammed with requests since every lookup for a transaction would need to hit multiple of those indexes. This traffic pattern likely doesn't scale. (NOTE: Skip graphs don't suffer from this problem) Second, the top level indexes would be incredibly expensive to build, and prohibitively expensive to verify. Both processes would require fetching all of the blocks under that index which for the top level ones would mean fetching millions of blocks. Grouping by truncated hashWe could group them by the first few bytes of the hash and index all of the transactions that start with the same common prefix. These indexes would be near impossible to build and verify since it would require fetching every single transaction in order to find the ones that share a prefix. We can potentially mitigate this by building these indexes iteratively, however this does introduce complexity that would be nice to avoid, because it would mean that these indexes would continually change and never be finalized which violates some current invariants with how we treat content. There is also the problem of picking the prefix size. Over time the total number of transaction will grow and we would need to adjust the prefix size which would be disruptive and prohibitively difficult to migrate. |
Exploring whether it is feasible to do the canonical transaction index without any reduction. At ~1 billion transactions and a network size between 1k-100k nodes, that means that with the following replication factors we would expect each node in the network to store:
With each advertisement being about 128 bytes we get roughly the following storage sizes
In order to stick to the desired <1GB per node storage requirement, I believe we can establish an upper bound of about 1M advertisements that each node can store which means that having to store 1 billion transaction index entries and having a health replication factor of 10 or more would mean the network could only be healthy at over 10k nodes which is feasible but far from ideal. We can also set an upper bound on the total number of advertisements that it is feasible for a node to broadcast in a given time. At 100/sec which is an over-estimation, we can broadcast:
Ideally, a node coming back online should be able to publish it's full database of advertisements in less than an hour. This suggests that we need to aim for an upper bound of around 200k items per node. Ideally we come in much lower than this. |
Brainstorming some ideas:
|
@lithp I may be mis-understanding your post, but I don't see any part that actually reduces the overall cost of the network hosting the CTI. Here's the best plan I have so far so that maybe we can anchor to something concrete. For any block Within the block there's some set of transactions Things I understand about this setup:
This setup is possibly viable, but it means that as much as 99% of the content in the network will be CTI entries and that sucks. |
Yes, sorry, I started out trying to answer this question, but I think I forgot what the original question had been while writing this up. The idea with both ideas was to prevent anyone from needing to advertise individual CTI entries. Advertising and storing a billion entries is not going to be fun. The idea behind both ideas was to prevent anyone from needing to advertise individual CTI entries. Each node which decides it wants to serve CTI entries picks the prefix that they want to serve. They can then make a single advertisement, they only have to advertise that prefix. The problem then becomes how do they build their index, and I was trying to brainstorm ways to solve that problem, it seems like it might be easier than the problem of managing 1B advertisements. |
Yeah, I'm leaning that way more and more. Advertising 1B things SUCKS because if we can trim that down, we immediately drop 2 orders of magnitude to only being in the 10's of millions. |
Skip graph of the transaction hashes would allow for ordered traversal of the transaction keys for building the index.... but that requires the network maintain the skip graph.... which would have 1B entries in it... which would require every one of those entries to be broadcast at some point... but it would not require those to be in the advertisements table.... I think that's an improvement over just doing it with advertisements since it reduces the advertisement size, and reduces total storage needs since skip graph entries don't have to be constantly re-newed like advertisements. I think some complexity still shows back up in managing the skip graph. Ideally we want to be able to reject "junk* entries that don't represent a canonical transaction hash. That means that for each skip graph entry you'd need to be able to validate that the transaction hash is indeed part of the canonical chain... so we'd need to do something like bundle the block number and transaction index into the skip graph entry so that those values can be validated... which means the skip graph alone would be the canonical transaction index and we wouldn't need to use the advertisement system for that at all. |
What was wrong?
A naive approach to storing the chain history in Alexandria looks like this (as 2020/11/05)
Depending on how you ad these numbers up you get something in the 2-3 billion range for total number of objects that need to be stored and retrievable from the network.
Content retrieval depends on the healthy spread of advertisements. We can put a reasonably firm upper bound on how many advertisements we can send out per second at 100 which is probably optimistic. This means that a node that wished to serve all of the content can sent about 8 million per day (100 * 24 * 60 * 60). That means it'll take about a full year running 24/7 to broadcast everything.
We can also look at this from the view of number of nodes in the network. If we assume we can bootstrap the network to have somewhere between 1k-100k nodes we can examine what each node on the network will need to store.
How can it be fixed?
We need to optimize the storage scheme to reduce the total number of objects in the network.
Headers
For headers we may be able to group headers by an "epoch" boundary based on block height. We would pick an epoch size
N
and then group everyN
headers together for storage and retrieval.The primary problem with this scheme is that it makes headers only retrievable if you know the block numbers. Since the block hash tends to be the canonical way to definitively reference a header, this would make serving
eth_getBlockByHash
difficult, requiring some sort of secondary index to lookup block numbers by block hash.It's likely that we need to store headers individually.
Canonical Header Index
The "Canonical Header Index" maps block number to block hash. Since this index is based on block numbers, it should be easy to group by epochs. Instead of storing a single entry for each block numbers, we would instead store an entry for each
N
blocks.If we used an epoch size of 1000, then retrieving the index entry for height 4567 would mean fetching the index of epoch 5 and digging out the 467th entry from the index.
This would let us reduce 11 million entries down to 11 thousand.
Block Bodies
Block bodies house are uncles and transactions.
We do want uncles and transactions to be individually retrievable.
Simply by grouping these by block we can reduce the 1 billion transactions down to just the 11 million blocks that they are naturally grouped by. This also works well with
eth_getTransactionByBlockHashAndIndex
style endpoints which retrieve transactions by their location in the block.Receipts
Receipts naturally group themselves by blocks as well, letting us reduce the 1 billion receipts down to just the 11 million blocks they are part of.
Canonical Transaction Index
This one is unsolved. Transactions need to be retrievable by hash.
The text was updated successfully, but these errors were encountered: