merkle: innerHash dominates CPU/RAM consumption by sha256.Sum256(append(bytes...)) but it can be further sped up and allocations reduced as well by using a pool and reusing hashes #43

odeke-em · 2024-02-29T16:29:31Z

Summary of Bug

I was auditing the code for both security vulnerabilities and performance improvements and after reading through the literature on HashFromByteSlicesIterative at

go-square/merkle/tree.go

Line 62 in c20eda9

func HashFromByteSlicesIterative(input [][]byte) []byte {

I saw that the prior speed ups had hit their limit; I also profiled the code and indeed confirmed that innerHash consumes lots of CPU and RAM too hence why the use of the recursive vs iterative hashing doesn't show significant differences; the profiles indeed confirmed that Amdahl's law does hold

Analysis

If we examine the code for innerHash

go-square/merkle/hash.go

Lines 23 to 31 in c20eda9

    
           // returns sha256(0x01 || left || right) 
        
           func innerHash(left []byte, right []byte) []byte { 
        
           	return hash(append(innerPrefix, append(left, right...)...)) 
        
           } 
        
           func hash(bz []byte) []byte { 
        
           	h := sha256.Sum256(bz) 
        
           	return h[:] 
        
           }

we can notice these interesting properties

go-square/merkle/hash.go

Line 29 in c20eda9

h := sha256.Sum256(bz)

that code uses a global call sha256.Sum256 whose code creates a new digest everytime and discards it per https://cs.opensource.google/go/go/+/refs/tags/go1.22.0:src/crypto/sha256/sha256.go;l=253

and right there is our opportunity!

Remedy and optimization

We can reuse hashes and reset them each time, but even more, we can avoid the call to append each time but instead invoke h.Write(byteslice)

by this code

var sha256Pool = &sync.Pool{New: func() any { return sha256.New() }}
        
func innerHashPooled(left, right []byte) []byte {
        h := sha256Pool.Get().(hash.Hash)
        defer sha256Pool.Put(h)
                        
        h.Reset()
        h.Write(innerPrefix)
        h.Write(left)
        h.Write(right)  
                        
        return h.Sum(nil)
}

Results

When we use that new routine, voila the results are remarkable

$ benchstat before.txt after.txt 
name                old time/op    new time/op    delta
HashAlternatives-8    79.9ms ± 3%    73.8ms ± 3%   -7.55%  (p=0.000 n=10+10)

name                old alloc/op   new alloc/op   delta
HashAlternatives-8    28.0MB ± 0%    13.6MB ± 0%  -51.42%  (p=0.000 n=9+10)

name                old allocs/op  new allocs/op  delta
HashAlternatives-8      500k ± 0%      300k ± 0%  -40.00%  (p=0.000 n=10+10)

of which we shaved off 1+ second comparably against leafHash :-)

/cc @elias-orijtech @rootulp @liamsi @musalbas

Version

4135f35

Steps to Reproduce

Please read the above!

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

The text was updated successfully, but these errors were encountered:

This change optimizes SHA256 hashing for leafHash & innerHash by pooling and reusing hashes, which massively reduces on RAM consumption and even CPU time (benchmarking on my noisy machine is skewed but the results can be verified individually) producing results below: ```shell $ benchstat before.txt after.txt name old time/op new time/op delta HashAlternatives/recursive-8 77.6µs ± 2% 77.0µs ± 2% ~ (p=0.165 n=10+10) HashAlternatives/iterative-8 76.3µs ± 1% 76.2µs ± 3% ~ (p=0.720 n=9+10) name old alloc/op new alloc/op delta HashAlternatives/recursive-8 25.4kB ± 0% 6.4kB ± 0% -74.94% (p=0.000 n=10+10) HashAlternatives/iterative-8 28.1kB ± 0% 9.1kB ± 0% -67.78% (p=0.000 n=10+10) name old allocs/op new allocs/op delta HashAlternatives/recursive-8 497 ± 0% 199 ± 0% -59.96% (p=0.000 n=10+10) HashAlternatives/iterative-8 498 ± 0% 200 ± 0% -59.84% (p=0.000 n=10+10) ``` Fixes celestiaorg#43

odeke-em · 2024-02-29T16:45:00Z

I've mailed out a PR #44

This change optimizes SHA256 hashing for leafHash & innerHash by pooling and reusing hashes, which massively reduces on RAM consumption and even CPU time (benchmarking on my noisy machine is skewed but the results can be verified individually) producing results below: ```shell $ benchstat before.txt after.txt name old time/op new time/op delta HashAlternatives/recursive-8 77.6µs ± 2% 77.0µs ± 2% ~ (p=0.165 n=10+10) HashAlternatives/iterative-8 76.3µs ± 1% 76.2µs ± 3% ~ (p=0.720 n=9+10) name old alloc/op new alloc/op delta HashAlternatives/recursive-8 25.4kB ± 0% 6.4kB ± 0% -74.94% (p=0.000 n=10+10) HashAlternatives/iterative-8 28.1kB ± 0% 9.1kB ± 0% -67.78% (p=0.000 n=10+10) name old allocs/op new allocs/op delta HashAlternatives/recursive-8 497 ± 0% 199 ± 0% -59.96% (p=0.000 n=10+10) HashAlternatives/iterative-8 498 ± 0% 200 ± 0% -59.84% (p=0.000 n=10+10) ``` Fixes celestiaorg#43

Wondertan · 2024-02-29T17:01:42Z

This merkle tree implementation is forked from Tendermint/Comet and ideally such optimizations should also be submitted there.

Btw, @odeke-em do you know any high quality open source merkle tree implementations as standalone modules? Maintaining yet another merkle tree in go-square doesnt seem like a good idea. Potentially, we could nudge cosmos ecosystem to make one, depend on it and submit all the improvements there.

This change optimizes SHA256 hashing for leafHash & innerHash by pooling and reusing hashes, which massively reduces on RAM consumption and even CPU time (benchmarking on my noisy machine is skewed but the results can be verified individually) producing the results below: ```shell $ benchstat before.txt after.txt name old time/op new time/op delta HashAlternatives/recursive-8 77.6µs ± 2% 77.0µs ± 2% ~ (p=0.165 n=10+10) HashAlternatives/iterative-8 76.3µs ± 1% 76.2µs ± 3% ~ (p=0.720 n=9+10) name old alloc/op new alloc/op delta HashAlternatives/recursive-8 25.4kB ± 0% 6.4kB ± 0% -74.94% (p=0.000 n=10+10) HashAlternatives/iterative-8 28.1kB ± 0% 9.1kB ± 0% -67.78% (p=0.000 n=10+10) name old allocs/op new allocs/op delta HashAlternatives/recursive-8 497 ± 0% 199 ± 0% -59.96% (p=0.000 n=10+10) HashAlternatives/iterative-8 498 ± 0% 200 ± 0% -59.84% (p=0.000 n=10+10) ``` Fixes celestiaorg#43

odeke-em · 2024-02-29T17:14:45Z

I've gotten jaded from trying to contribute back to tendermint repositories; am not sure I have the time and bandwidth to try to push up that hill, but at least I've highlighted what y'all need to do :-)

As for high quality Merkle tree implementations, I had found some but I am dashing for dinner though I'll look at my list tomorrow.

cmwaters · 2024-09-16T14:57:42Z

We've reverted back to using the merkle tree in the tendermint/cometbft library. So this issue can be closed.

If there are any other good suitors as a merkle library, I would also be interested to hear of them

odeke-em added the bug Something isn't working label Feb 29, 2024

github-actions bot added the external label Feb 29, 2024

odeke-em mentioned this issue Feb 29, 2024

perf(merkle): optimize hashing by reusing sha256 in a pool #44

Closed

5 tasks

cmwaters closed this as completed Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merkle: innerHash dominates CPU/RAM consumption by sha256.Sum256(append(bytes...)) but it can be further sped up and allocations reduced as well by using a pool and reusing hashes #43

merkle: innerHash dominates CPU/RAM consumption by sha256.Sum256(append(bytes...)) but it can be further sped up and allocations reduced as well by using a pool and reusing hashes #43

odeke-em commented Feb 29, 2024

odeke-em commented Feb 29, 2024

Wondertan commented Feb 29, 2024

odeke-em commented Feb 29, 2024

cmwaters commented Sep 16, 2024

merkle: innerHash dominates CPU/RAM consumption by sha256.Sum256(append(bytes...)) but it can be further sped up and allocations reduced as well by using a pool and reusing hashes #43

merkle: innerHash dominates CPU/RAM consumption by sha256.Sum256(append(bytes...)) but it can be further sped up and allocations reduced as well by using a pool and reusing hashes #43

Comments

odeke-em commented Feb 29, 2024

Summary of Bug

Analysis

Remedy and optimization

Results

Version

Steps to Reproduce

For Admin Use

odeke-em commented Feb 29, 2024

Wondertan commented Feb 29, 2024

odeke-em commented Feb 29, 2024

cmwaters commented Sep 16, 2024