Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security concerns of having Ehcache as the default #28546

Open
ben-manes opened this issue Jan 26, 2025 · 14 comments
Open

Security concerns of having Ehcache as the default #28546

ben-manes opened this issue Jan 26, 2025 · 14 comments

Comments

@ben-manes
Copy link

ben-manes commented Jan 26, 2025

From Using a cache,

Ehcache is the default cache with monoliths in JHipster.

Ehcache 3 includes a trivial hash flooding denial of service attack, which was reported against 3.0-M4 in 2015. Unfortunately the authors simply resolved the issue without fixing, and then continued to ignore community concerns since then. When testing the latest release, 3.10.8, it still includes this problem. I do not think this should be the default cache, not because there are bugs, but because it has been 10 years of hiding a severe flaw instead of making a trivial fix.

The performance of their eviction policy degrades as the cache size increases due to a poor sampling implementation in their custom hash table. This implies that the more critical, performance sensitive caches are the most likely to be vulnerable and easiest to exploit due to their central usage. On my M3 Max laptop, Ehcache takes nearly 20 minutes in a database simulation, whereas Guava and Caffeine each take only 15 seconds or less. The runtime will be significantly worse on typical, virtualized cloud compute rather than a high performance laptop.

If we apply a simple patch of using the mix function in SplittableRandom, Stafford's variant 13, then the execution time drops to 40 seconds. Further improvements could be made by the developers, but users can mitigate this problem if they are careful with their hash codes.

HashDoS demonstration

$ git clone https://github.com/ben-manes/caffeine.git
$ pushd caffeine ; git checkout 2fee4bdaff40a76c44cccbdfe954059ed6c3f2bc ; popd
$ wget https://github.com/moka-rs/cache-trace/raw/ef0a9de8cf0202a1f2fee186a0af497774b0f0a9/arc/DS1.lis.zst
$ ./caffeine/gradlew --project-dir caffeine simulator:run -q \
  -Dcaffeine.simulator.files.format="arc" \
  -Dcaffeine.simulator.maximum-size=4_000_000 \
  -Dcaffeine.simulator.policies.0="product.Ehcache3" \
  -Dcaffeine.simulator.files.paths.0="$PWD/DS1.lis.zst"
╔══════════════════╤══════════╤═══════════╤════════════╤════════════╤════════════╤════════════╤═══════════╗
║ Policy           │ Hit Rate │ Miss Rate │ Hits       │ Misses     │ Requests   │ Evictions  │ Time      ║
╠══════════════════╪══════════╪═══════════╪════════════╪════════════╪════════════╪════════════╪═══════════╣
║ product.Ehcache3 │ 28.01 %  │ 71.99 %   │ 12,240,263 │ 31,464,716 │ 43,704,979 │ 27,464,716 │ 18.24 min ║
╚══════════════════╧══════════╧═══════════╧════════════╧════════════╧════════════╧════════════╧═══════════╝

When running, we can observe this as the problem by using jstack

$ jps | grep Simulator
29929 Simulator
$ jstack 29929 | grep \"Ehcache3Policy\" -A10
"Ehcache3Policy" #32 [43267] daemon prio=5 os_prio=31 cpu=188377.19ms elapsed=193.78s tid=0x0000000126018400 nid=43267 runnable  [0x0000000172bae000]
   java.lang.Thread.State: RUNNABLE
	at org.ehcache.impl.internal.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3410)
	at org.ehcache.impl.internal.concurrent.ConcurrentHashMap.getEvictionCandidate(ConcurrentHashMap.java:6493)
	at org.ehcache.impl.internal.store.heap.SimpleBackend.getEvictionCandidate(SimpleBackend.java:53)
	at org.ehcache.impl.internal.store.heap.OnHeapStore.evict(OnHeapStore.java:1579)
	at org.ehcache.impl.internal.store.heap.OnHeapStore.enforceCapacity(OnHeapStore.java:1558)
	at org.ehcache.impl.internal.store.heap.OnHeapStore.put(OnHeapStore.java:365)
	at org.ehcache.core.Ehcache.doPut(Ehcache.java:93)
	at org.ehcache.core.EhcacheBase.put(EhcacheBase.java:187)
	at com.github.benmanes.caffeine.cache.simulator.policy.product.Ehcache3Policy.record(Ehcache3Policy.java:80)

We can observe that other caches do not have this problem, such as Guava's LRU.

$ ./caffeine/gradlew --project-dir caffeine simulator:run -q \
  -Dcaffeine.simulator.files.format="arc" \
  -Dcaffeine.simulator.maximum-size=4_000_000 \
  -Dcaffeine.simulator.policies.0="product.Guava" \
  -Dcaffeine.simulator.files.paths.0="$PWD/DS1.lis.zst"
╔═══════════════╤══════════╤═══════════╤═══════════╤════════════╤════════════╤════════════╤═════════╗
║ Policy        │ Hit Rate │ Miss Rate │ Hits      │ Misses     │ Requests   │ Evictions  │ Time    ║
╠═══════════════╪══════════╪═══════════╪═══════════╪════════════╪════════════╪════════════╪═════════╣
║ product.Guava │ 20.24 %  │ 79.76 %   │ 8,847,984 │ 34,856,995 │ 43,704,979 │ 30,856,995 │ 14.44 s ║
╚═══════════════╧══════════╧═══════════╧═══════════╧════════════╧════════════╧════════════╧═════════╝

or Caffeine's W-TinyLFU,

$ ./caffeine/gradlew --project-dir caffeine simulator:run -q \
  -Dcaffeine.simulator.files.format="arc" \
  -Dcaffeine.simulator.maximum-size=4_000_000 \
  -Dcaffeine.simulator.policies.0="product.Caffeine" \
  -Dcaffeine.simulator.files.paths.0="$PWD/DS1.lis.zst"
╔══════════════════╤══════════╤═══════════╤════════════╤════════════╤════════════╤════════════╤═════════╗
║ Policy           │ Hit Rate │ Miss Rate │ Hits       │ Misses     │ Requests   │ Evictions  │ Time    ║
╠══════════════════╪══════════╪═══════════╪════════════╪════════════╪════════════╪════════════╪═════════╣
║ product.Caffeine │ 45.83 %  │ 54.17 %   │ 20,028,995 │ 23,675,984 │ 43,704,979 │ 19,675,984 │ 6.937 s ║
╚══════════════════╧══════════╧═══════════╧════════════╧════════════╧════════════╧════════════╧═════════╝

HashDoS mitigation

$ wget https://github.com/user-attachments/files/18551302/ehcache.patch
$ pushd caffeine ; git apply ../ehcache.patch ; popd
$ ./caffeine/gradlew --project-dir caffeine simulator:run -q \
  -Dcaffeine.simulator.files.format="arc" \
  -Dcaffeine.simulator.maximum-size=4_000_000 \
  -Dcaffeine.simulator.policies.1="product.Ehcache3" \
  -Dcaffeine.simulator.files.paths.0="$PWD/DS1.lis.zst"
╔══════════════════╤══════════╤═══════════╤═══════════╤════════════╤════════════╤════════════╤═════════╗
║ Policy           │ Hit Rate │ Miss Rate │ Hits      │ Misses     │ Requests   │ Evictions  │ Time    ║
╠══════════════════╪══════════╪═══════════╪═══════════╪════════════╪════════════╪════════════╪═════════╣
║ product.Ehcache3 │ 21.97 %  │ 78.03 %   │ 9,600,600 │ 34,104,379 │ 43,704,979 │ 30,104,379 │ 39.94 s ║
╚══════════════════╧══════════╧═══════════╧═══════════╧════════════╧════════════╧════════════╧═════════╝

You might notice that Ehcache's sampled LRU's hit rate decreased from 28% to more closely match Guava's LRU. This is because a database workload has a strong MRU / LFU bias, so the worse distribution randomly selects a better victim. This would be reverse in a recency-biased workload, such as in data streams.

Addendum While I disclosed this issue in 2015, I do not interact with the Ehcache developers. They publicly called me a fraud and claimed that I faked and manipulated my benchmarks. These were baseless accusations with no technical foundation, but was intended to simply slander my project before it had much adoption. I find it no surprise that their project has withered instead, perhaps due to their lack of professionalism. While my project certainly had bugs and flaws that were found and addressed, and I have voiced concerns over alternative's claims, in every instance I have used data and an open, fair analysis to explain my beliefs. As JHipster is one of the few projects that still prefers Ehcache, I think it is worth you being aware of this problem, perhaps make progress in having it resolved, or reconsider the defaults for your users.
@atomfrede
Copy link
Member

Thanks @ben-manes for the detailed report. I think ehcache was one of the early decisions which have never been reconsidered that much.
I agree with using another one as personally I have used guava and caffeine in other projects.

@atomfrede
Copy link
Member

@jhipster/developers What do you think? I remember some older discussions about making caffeine the default choice, but can't find it anymore.

@mraible
Copy link
Contributor

mraible commented Jan 30, 2025 via email

@atomfrede
Copy link
Member

@mraible I can't find any dedicated "primary" recommendation (https://docs.spring.io/spring-boot/reference/io/caching.html or https://docs.spring.io/spring-framework/reference/integration/cache/store-configuration.html) in the docs. Or do we use ehcache by default as it integrates easily(?) with hibernates 2nd level cache?
It looks like we support both (ehcache and caffeine) for hibernate 2nd level cache via jcache abstraction. So maybe we can make caffeine the default (with the next release).

@ben-manes
Copy link
Author

I am certainly fine if you leave your defaults but are able to wrangle the Ehcache team to fix this problem and release an upgrade. I consider this to be a serious issue and am very disappointed that they do not care. Since it has not been exploited it may be my overreaction as I would have resolved it asap in their place (Caffeine's 2.2.7 release).

@atomfrede
Copy link
Member

I think caffeine is more versatile and easier too use, so it would be a "win-win"

@jdubois
Copy link
Member

jdubois commented Jan 30, 2025

I'm the culprit for ehcache, but that's only because at that time (more than 10 years ago!!) there wasn't much choice! The strong decision for me is to have a 2nd level cache, then for the implementation that's totally up for discussion!
I believe that Caffeine seems indeed to be technically better, and also I've seen @ben-manes several times in our discussions, and the relationship/support for the 3rd party libraries we use is also super important (I'd rather work with friendly people, who contribute and help, as it's how the project has always been successful).
I'm not sure of the impact of migrating the default cache.... There's a "hidden cost" in tools like JHipster Online, as we have this in the UI. So it would also depend on that work.

@mraible
Copy link
Contributor

mraible commented Jan 30, 2025

@henri-tremblay Since you're a contributor to Ehcache, do you have any thoughts on this?

@henri-tremblay
Copy link
Contributor

I am not working on Ehcache anymore. And have lost a bit track of the developments on it. Spring Boot detects cache in the following order: https://docs.spring.io/spring-boot/docs/2.1.7.RELEASE/reference/html/boot-features-caching.html#boot-features-caching-provider. Ehcache is second after the generic one and caffeine is at the bottom. I have not opinion on that, but it's what they do.

If I remember well, Ehcache, at least by default, will indeed by unhappy if the cache is full and you keep adding to it, thus causing evictions to occur. It's traversing the cache through a little sample of entries to find the best one to evict. It's not a crazy way to do it and I think it can be tweaked with some configuration. I remember it as a trade-off. The cache works super well but should never be full. Something like that. But I will let current Ehcache chip in to give a full picture. @chrisdennis maybe Hopefully everyone will stay calm.

@Sanne
Copy link

Sanne commented Feb 7, 2025

For what it's worth, in Quarkus we only offer two options OOB:

  • Infinispan for distributed / cluster needs. Its eviction strategies are based on Caffeine.
  • A custom "light" Quarkus cache for simple local needs. Also based on Caffeine.

In both cases, big thanks to @ben-manes as he helped us make a really good integration.

I believe as in any OSS project it's not too hard for end users to plug an alternative implementation but I'd not recommend that at all, it's very tricky business and people tend to measure the performance under the wrong conditions, or looking at metrics which are not as critical as the cache's dynamic behaviour under stress.

Definitely a security issue here.

@mshima mshima added the v9 label Feb 7, 2025
@mshima
Copy link
Member

mshima commented Feb 7, 2025

I've opened a poll: #28630.

@mshima mshima mentioned this issue Feb 7, 2025
1 task
@chrisdennis
Copy link

The Ehcache team are aware that there is scope for performance improvement in the current local caching algorithms used in Ehcache 3. There are plans to address this in the near future, but given our current workload I cannot commit us to any specific timeline beyond "this calendar year". (Ehcache is not the only project that the team maintains, and the team is unfortunately very busy with work items tied to our acquisition by IBM at the moment.)

I also disagree with the severity being applied (or implied) to this security issue, and the subsequent use of that status as leverage in this case. Further, I would also disagree with any suggestion that additional static mixing is a trivial fix for the problem. My opinion is that static hash mixing alone cannot prevent a hash flooding attack. It just amounts to security through obscurity. There are however approaches that could be used to randomize the mixing in such a way as to make discovery of the appropriate flooding attack infeasible.

I would also like to take this opportunity to object to the professional smear that @ben-manes appears to continue to want to apply in a blanket fashion to the long list of past and present developers of Ehcache. We are simply people doing our best to contribute toward our little corner of the open source software community. We don't expect or solicit praise, but I feel that I can and must request that we not be labelled as unprofessional or be accused of slander in public forums. If however, any statement made by a member of the team, whether past or present, was intentionally or unintentionally slanderous, was taken in that manner, or has in any way besmirched his good name then let me take this opportunity to apologize on behalf of the team.

Regarding the decision of the JHipster team as to their default caching provider, in the interests of transparency: I will make no attempt to lobby for or against any specific implementation, nor will I participate in any poll regarding what the default caching provider for JHipster should be. I am however prepared to answer any factual or technical questions that any community member might have about Ehcache 3 to the best of my abilities.

@ben-manes
Copy link
Author

I will happily take that as an apology or agreed misunderstanding, and allow that tension to finally subside. I have consciously stayed far away from your project due to this where I did not want to feel this aggravated stress. Admittedly avoidance did not resolve the festered frustration and it consciously upset me.

I strongly disagree with the technical decisions regarding how this can impact users. If there are design decisions that characterize LRU eviction as misuse then they should be communicated to users for their benefit so that they can best enjoy otherwise wonderful software. Since trust is the most important characteristic between users and their solutions, the very low importance given to this issue over the years conveys a lot and those further unknowns is what I find the most worrisome.

While design philosophies may differ, I believe it is better that defaults should lead users towards safe, secure, functional, and easily understandable systems. Beyond that the rest are features, like simplicity or speed or documentation, which are each very valuable. My only ask to the jhipster team is that they understand their defaults in order to best serve their users and treat them empathetically.

@ben-manes
Copy link
Author

ben-manes commented Feb 23, 2025

Perhaps my explanations focused too much on the problematic execution times rather than the underlying data structures and why they degrade. That analysis is what the maintainers should have performed, which I think is actually quite interesting, and hopefully it will be evident how it impacts end users (such as in this 2018 experience report).

The fundamental problem

Typically an object's hashCode() is a light accumulation of its attributes. For example, photos in an album might be hashed by the author id, album id, and photo id. As each of those may be sequence numbers, and two attributes are similar for all photos in the album, a simplistic hash function like 31 * result + e.hashCode() will scramble only a few bits in the bottom half of the integer. The hash table will select its array location by a modulus, for example, mod 16 is the least significant four bits. This leads to clustering, where entries concentrate in a subset of bins, making some significantly more populated than others due to collisions. When the hash table reaches its load factor (75% full), it resizes to maintain a more uniform distribution.

A traditional hash table mitigates clustering by applying a stronger bit mixer, such as Murmur3, to make the bit patterns of similar keys more distinct. The goal is that with a single bit flip, like the photo id increment, the output should appear effectively random. This is known as the avalanche effect and tools like hash-prospector try to find hash functions with a very low bias. We can see this in the output of their 2-round function (adjusted for Java by using an unsigned right shift), using Objects.hash(...) for the key's hashCode(), Integer.toBinaryString(int), and jshell to generate a few examples for a hash table with 1024 bins. The hash code changes only by its 8 least significant bits and there is no discernible pattern once spread.

(user, album, photo) hash spread hash index spread index
(1000, 1, 1) 11110001111001100111 11110010101000010100110011110100 615 244
(1000, 1, 2) 11110001111001101000 11100001010000100110101101011100 616 860
(1000, 2, 1) 11110001111010000110 00011110100011000110011100110110 646 822
(1000, 2, 2) 11110001111010000111 10001100001001111101101111101000 647 1000

In a traditional hash flooding attack, the concern is that a lack of uniform distribution of entries across the hash table causes it to degrade to an O(n) lookup when searching a linked list. To protect the static hash function from being defeated by specially crafted keys it is often enhanced by using the application's startup system time (not wall clock) as an XOR seed to force random bit flips, e.g. spread(e.hashCode() ^ SEED).

Java's approach was to instead switch from linked list bins to red-black tree bins so that the worst-case is O(lg n), e.g. 1M entries require up to 20 comparisons and 1B requires up to 30 comparisons. This avoids any direct attack on the hash function. The goal of its mixing function, h ^ (h >>> 16) & 0x7fffffff, is to spread entropy from higher bits down into the lower bits because the bin is selected using the least significant bits. It does not randomize poor hash distributions (no avalanche) which allows for clustering so that key patterns like sequential numbers or low-entropy strings will map to nearby bins. Since the map is designed for fast key lookups this is perfectly fine because the difference between a more complex mixing function versus a few more comparisons is noise, so the authors may have viewed it as unnecessary or easily misunderstood as sensitive like in traditional hash tables.

In either design, iterating over the hash table requires visiting every bin. The load factor means that it will double the array when the map reaches 75% of its expected capacity in order to reduce collisions into the same bins. Said another way, the load factor of 0.75 means that at least 25% of the table is always available (e.g. empty bins). While less critical for tree-based bins, maintaining free space contributes to near O(1) performance and, due to per-bin locking, higher write concurrency. Iterating over the hash table requires traversing the entire array and its bins, resulting in an O(N + number of bins) operation. This is not a common operation or the primary use-case for a Map, and especially not for large ones, so it is expected to have acceptable performance but not be optimal. The clustering causes most bins to be empty, which has little impact on the overall time complexity or performance of Map operations.

The addition of eviction sampling invalidates the above design assumptions. In the provided example, at 4M entries the array length is 8M (2^23) and the non-uniform spreader causes only 26% of those to be bins (2,211,582 bins and 6,177,026 null slots). To perform an eviction, a uniformly random start index is chosen in the array and then it scans for bins to compare the first 8 entries found. As the key hash codes are allowed to cluster, the bins tend to form in groups within the table. In one case when attaching the debugger where the start index was 1,364,883, the traverser had to scan 967,573 null array slots before it found the next bin. Since the hash table has many large gaps where a search might randomly begin, the better-distributed entries are likely to be found first, leading to their bins being emptied and widening the gaps. The eviction effectively defragments the array by making the clustering worse and leads to skewing the policy towards an MRU because the older items are protected by its neighbors.

Image

Can the compiler and hardware help?

One could argue that scanning contiguous memory is well-optimized by hardware and eviction may still be fast despite not being O(1). The problem with that is an over reliance on the hardware prefetcher and cache. An array length of 8M requires either 32MB or 64MB, depending on if the garbage collector uses compressed (32-bit) pointers for heaps less than 32GB. On a modern high-end machine this might fit into the L3 cache if lucky, but that is very unlikely due to the other application data like the entries themselves. For smaller workloads the impact will be reduced because the CPU cache can absorb it in a simulated environment, however in a cloud environment this is not likely to be as beneficial due to competing work. This can lead to many cache misses and slower execution times.

Ideally, the prefetcher would assist when detecting a contiguous scan. For a sequential lookahead that could range from 2-8 KBs (32-128 cache lines) and for a distance of ~1M that means 4MB or 8MB of memory. Even if it aggressively fetches, since it is a simple null check the traversal would be limited by the DRAM latency of 50-100ns. That calculates as ((scan mb) / (prefetch size)) * (dram latency), resulting in a range of 25,000ns (25us) to 400,000ns (0.4ms) per run. The Linux thread scheduler uses an adaptive time slice of 1–6ms, allowing for approximately 2-240 evictions per thread before it is suspended. For most array traversals the compiler optimizes by loop unrolling, auto vectorizing, etc which can improve prefetching and enable larger lookahead sizes.

Unfortunately, optimization by the hardware and compiler is hindered by the concurrent nature of the data structure. This introduces memory barriers like volatile reads which prevents reordering memory accesses, restricts prefetching, and breaks the code shape for the compiler to apply its optimizations. Since the array is scanned sequentially at random offsets, the hardware cache likely won't see much reuse across concurrent evictions as they search different areas and the cpu cache may aggressively evict those entries due to sequential streaming detection in an effort to be scan resistant.

Mitigations

Getting back to the original problem, we saw that clustering caused the long traversals when trying to find 8 candidates to evaluate for eviction. There are two obvious solutions to this that would not require extra memory or a major change.

The sampled eviction assumes that entries are uniformly distributed, per a traditional hash table, so that its random starting point can find non-empty bins quickly. A stronger hash function restores that characteristic and the runtime drops dramatically in real, non-malicious workloads. While ideally an internal change, users can premix their hash codes as a workaround.

Another approach would be to reuse the iterator across evictions, relying on its weakly consistent behavior. This would avoid the random starts in the large gaps by galloping through them and spending more time within the clusters. his would require an internal change to serialize the eviction process and use a cyclical iterator, instead of the current concurrent eviction approach. This approach still suffers from the eviction latency costs of a large gap, but reduces how often that penalty is incurred.

Combining the hash mixer and cyclical eviction iterator would give the best results. This is because while the iterator reuse does not rely on the hash quality, it benefits by having the uniform mixer remove those spans to traverse through. If a collision attack against the hash function was successful then its impact would be minimal, thereby making the effort not worthwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants