Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merkle: innerHash dominates CPU/RAM consumption by sha256.Sum256(append(bytes...)) but it can be further sped up and allocations reduced as well by using a pool and reusing hashes #43

Closed
4 tasks
odeke-em opened this issue Feb 29, 2024 · 4 comments
Labels
bug Something isn't working external

Comments

@odeke-em
Copy link
Contributor

Summary of Bug

I was auditing the code for both security vulnerabilities and performance improvements and after reading through the literature on HashFromByteSlicesIterative at

func HashFromByteSlicesIterative(input [][]byte) []byte {
I saw that the prior speed ups had hit their limit; I also profiled the code and indeed confirmed that innerHash consumes lots of CPU and RAM too hence why the use of the recursive vs iterative hashing doesn't show significant differences; the profiles indeed confirmed that Amdahl's law does hold

Screenshot 2024-02-29 at 7 19 07 PM

Analysis

If we examine the code for innerHash

go-square/merkle/hash.go

Lines 23 to 31 in c20eda9

// returns sha256(0x01 || left || right)
func innerHash(left []byte, right []byte) []byte {
return hash(append(innerPrefix, append(left, right...)...))
}
func hash(bz []byte) []byte {
h := sha256.Sum256(bz)
return h[:]
}

we can notice these interesting properties

h := sha256.Sum256(bz)

that code uses a global call sha256.Sum256 whose code creates a new digest everytime and discards it per https://cs.opensource.google/go/go/+/refs/tags/go1.22.0:src/crypto/sha256/sha256.go;l=253
Screenshot 2024-02-29 at 7 21 33 PM

and right there is our opportunity!

Remedy and optimization

We can reuse hashes and reset them each time, but even more, we can avoid the call to append each time but instead invoke h.Write(byteslice)

by this code

var sha256Pool = &sync.Pool{New: func() any { return sha256.New() }}
        
func innerHashPooled(left, right []byte) []byte {
        h := sha256Pool.Get().(hash.Hash)
        defer sha256Pool.Put(h)
                        
        h.Reset()
        h.Write(innerPrefix)
        h.Write(left)
        h.Write(right)  
                        
        return h.Sum(nil)
}          

Results

When we use that new routine, voila the results are remarkable

$ benchstat before.txt after.txt 
name                old time/op    new time/op    delta
HashAlternatives-8    79.9ms ± 3%    73.8ms ± 3%   -7.55%  (p=0.000 n=10+10)

name                old alloc/op   new alloc/op   delta
HashAlternatives-8    28.0MB ± 0%    13.6MB ± 0%  -51.42%  (p=0.000 n=9+10)

name                old allocs/op  new allocs/op  delta
HashAlternatives-8      500k ± 0%      300k ± 0%  -40.00%  (p=0.000 n=10+10)
Screenshot 2024-02-29 at 7 27 01 PM

of which we shaved off 1+ second comparably against leafHash :-)

/cc @elias-orijtech @rootulp @liamsi @musalbas

Version

4135f35

Steps to Reproduce

Please read the above!

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@odeke-em odeke-em added the bug Something isn't working label Feb 29, 2024
odeke-em added a commit to orijtech/go-square that referenced this issue Feb 29, 2024
This change optimizes SHA256 hashing for leafHash & innerHash
by pooling and reusing hashes, which massively reduces on RAM
consumption and even CPU time (benchmarking on my noisy machine
is skewed but the results can be verified individually) producing
results below:

```shell
$ benchstat before.txt after.txt
name                          old time/op    new time/op    delta
HashAlternatives/recursive-8    77.6µs ± 2%    77.0µs ± 2%     ~     (p=0.165 n=10+10)
HashAlternatives/iterative-8    76.3µs ± 1%    76.2µs ± 3%     ~     (p=0.720 n=9+10)

name                          old alloc/op   new alloc/op   delta
HashAlternatives/recursive-8    25.4kB ± 0%     6.4kB ± 0%  -74.94%  (p=0.000 n=10+10)
HashAlternatives/iterative-8    28.1kB ± 0%     9.1kB ± 0%  -67.78%  (p=0.000 n=10+10)

name                          old allocs/op  new allocs/op  delta
HashAlternatives/recursive-8       497 ± 0%       199 ± 0%  -59.96%  (p=0.000 n=10+10)
HashAlternatives/iterative-8       498 ± 0%       200 ± 0%  -59.84%  (p=0.000 n=10+10)
```

Fixes celestiaorg#43
@odeke-em
Copy link
Contributor Author

I've mailed out a PR #44

odeke-em added a commit to orijtech/go-square that referenced this issue Feb 29, 2024
This change optimizes SHA256 hashing for leafHash & innerHash
by pooling and reusing hashes, which massively reduces on RAM
consumption and even CPU time (benchmarking on my noisy machine
is skewed but the results can be verified individually) producing
results below:

```shell
$ benchstat before.txt after.txt
name                          old time/op    new time/op    delta
HashAlternatives/recursive-8    77.6µs ± 2%    77.0µs ± 2%     ~     (p=0.165 n=10+10)
HashAlternatives/iterative-8    76.3µs ± 1%    76.2µs ± 3%     ~     (p=0.720 n=9+10)

name                          old alloc/op   new alloc/op   delta
HashAlternatives/recursive-8    25.4kB ± 0%     6.4kB ± 0%  -74.94%  (p=0.000 n=10+10)
HashAlternatives/iterative-8    28.1kB ± 0%     9.1kB ± 0%  -67.78%  (p=0.000 n=10+10)

name                          old allocs/op  new allocs/op  delta
HashAlternatives/recursive-8       497 ± 0%       199 ± 0%  -59.96%  (p=0.000 n=10+10)
HashAlternatives/iterative-8       498 ± 0%       200 ± 0%  -59.84%  (p=0.000 n=10+10)
```

Fixes celestiaorg#43
@Wondertan
Copy link
Member

This merkle tree implementation is forked from Tendermint/Comet and ideally such optimizations should also be submitted there.

Btw, @odeke-em do you know any high quality open source merkle tree implementations as standalone modules? Maintaining yet another merkle tree in go-square doesnt seem like a good idea. Potentially, we could nudge cosmos ecosystem to make one, depend on it and submit all the improvements there.

odeke-em added a commit to orijtech/go-square that referenced this issue Feb 29, 2024
This change optimizes SHA256 hashing for leafHash & innerHash
by pooling and reusing hashes, which massively reduces on RAM
consumption and even CPU time (benchmarking on my noisy machine
is skewed but the results can be verified individually) producing
the results below:

```shell
$ benchstat before.txt after.txt
name                          old time/op    new time/op    delta
HashAlternatives/recursive-8    77.6µs ± 2%    77.0µs ± 2%     ~     (p=0.165 n=10+10)
HashAlternatives/iterative-8    76.3µs ± 1%    76.2µs ± 3%     ~     (p=0.720 n=9+10)

name                          old alloc/op   new alloc/op   delta
HashAlternatives/recursive-8    25.4kB ± 0%     6.4kB ± 0%  -74.94%  (p=0.000 n=10+10)
HashAlternatives/iterative-8    28.1kB ± 0%     9.1kB ± 0%  -67.78%  (p=0.000 n=10+10)

name                          old allocs/op  new allocs/op  delta
HashAlternatives/recursive-8       497 ± 0%       199 ± 0%  -59.96%  (p=0.000 n=10+10)
HashAlternatives/iterative-8       498 ± 0%       200 ± 0%  -59.84%  (p=0.000 n=10+10)
```

Fixes celestiaorg#43
@odeke-em
Copy link
Contributor Author

I've gotten jaded from trying to contribute back to tendermint repositories; am not sure I have the time and bandwidth to try to push up that hill, but at least I've highlighted what y'all need to do :-)

As for high quality Merkle tree implementations, I had found some but I am dashing for dinner though I'll look at my list tomorrow.

@cmwaters
Copy link
Collaborator

We've reverted back to using the merkle tree in the tendermint/cometbft library. So this issue can be closed.

If there are any other good suitors as a merkle library, I would also be interested to hear of them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants