-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Security concerns of having Ehcache as the default #28546
Comments
Thanks @ben-manes for the detailed report. I think ehcache was one of the early decisions which have never been reconsidered that much. |
@jhipster/developers What do you think? I remember some older discussions about making caffeine the default choice, but can't find it anymore. |
I think we should use whatever cache the Spring Boot team recommends.
|
@mraible I can't find any dedicated "primary" recommendation (https://docs.spring.io/spring-boot/reference/io/caching.html or https://docs.spring.io/spring-framework/reference/integration/cache/store-configuration.html) in the docs. Or do we use ehcache by default as it integrates easily(?) with hibernates 2nd level cache? |
I am certainly fine if you leave your defaults but are able to wrangle the Ehcache team to fix this problem and release an upgrade. I consider this to be a serious issue and am very disappointed that they do not care. Since it has not been exploited it may be my overreaction as I would have resolved it asap in their place (Caffeine's 2.2.7 release). |
I think caffeine is more versatile and easier too use, so it would be a "win-win" |
I'm the culprit for ehcache, but that's only because at that time (more than 10 years ago!!) there wasn't much choice! The strong decision for me is to have a 2nd level cache, then for the implementation that's totally up for discussion! |
@henri-tremblay Since you're a contributor to Ehcache, do you have any thoughts on this? |
I am not working on Ehcache anymore. And have lost a bit track of the developments on it. Spring Boot detects cache in the following order: https://docs.spring.io/spring-boot/docs/2.1.7.RELEASE/reference/html/boot-features-caching.html#boot-features-caching-provider. Ehcache is second after the generic one and caffeine is at the bottom. I have not opinion on that, but it's what they do. If I remember well, Ehcache, at least by default, will indeed by unhappy if the cache is full and you keep adding to it, thus causing evictions to occur. It's traversing the cache through a little sample of entries to find the best one to evict. It's not a crazy way to do it and I think it can be tweaked with some configuration. I remember it as a trade-off. The cache works super well but should never be full. Something like that. But I will let current Ehcache chip in to give a full picture. @chrisdennis maybe Hopefully everyone will stay calm. |
For what it's worth, in Quarkus we only offer two options OOB:
In both cases, big thanks to @ben-manes as he helped us make a really good integration. I believe as in any OSS project it's not too hard for end users to plug an alternative implementation but I'd not recommend that at all, it's very tricky business and people tend to measure the performance under the wrong conditions, or looking at metrics which are not as critical as the cache's dynamic behaviour under stress. Definitely a security issue here. |
I've opened a poll: #28630. |
The Ehcache team are aware that there is scope for performance improvement in the current local caching algorithms used in Ehcache 3. There are plans to address this in the near future, but given our current workload I cannot commit us to any specific timeline beyond "this calendar year". (Ehcache is not the only project that the team maintains, and the team is unfortunately very busy with work items tied to our acquisition by IBM at the moment.) I also disagree with the severity being applied (or implied) to this security issue, and the subsequent use of that status as leverage in this case. Further, I would also disagree with any suggestion that additional static mixing is a trivial fix for the problem. My opinion is that static hash mixing alone cannot prevent a hash flooding attack. It just amounts to security through obscurity. There are however approaches that could be used to randomize the mixing in such a way as to make discovery of the appropriate flooding attack infeasible. I would also like to take this opportunity to object to the professional smear that @ben-manes appears to continue to want to apply in a blanket fashion to the long list of past and present developers of Ehcache. We are simply people doing our best to contribute toward our little corner of the open source software community. We don't expect or solicit praise, but I feel that I can and must request that we not be labelled as unprofessional or be accused of slander in public forums. If however, any statement made by a member of the team, whether past or present, was intentionally or unintentionally slanderous, was taken in that manner, or has in any way besmirched his good name then let me take this opportunity to apologize on behalf of the team. Regarding the decision of the JHipster team as to their default caching provider, in the interests of transparency: I will make no attempt to lobby for or against any specific implementation, nor will I participate in any poll regarding what the default caching provider for JHipster should be. I am however prepared to answer any factual or technical questions that any community member might have about Ehcache 3 to the best of my abilities. |
I will happily take that as an apology or agreed misunderstanding, and allow that tension to finally subside. I have consciously stayed far away from your project due to this where I did not want to feel this aggravated stress. Admittedly avoidance did not resolve the festered frustration and it consciously upset me. I strongly disagree with the technical decisions regarding how this can impact users. If there are design decisions that characterize LRU eviction as misuse then they should be communicated to users for their benefit so that they can best enjoy otherwise wonderful software. Since trust is the most important characteristic between users and their solutions, the very low importance given to this issue over the years conveys a lot and those further unknowns is what I find the most worrisome. While design philosophies may differ, I believe it is better that defaults should lead users towards safe, secure, functional, and easily understandable systems. Beyond that the rest are features, like simplicity or speed or documentation, which are each very valuable. My only ask to the jhipster team is that they understand their defaults in order to best serve their users and treat them empathetically. |
Perhaps my explanations focused too much on the problematic execution times rather than the underlying data structures and why they degrade. That analysis is what the maintainers should have performed, which I think is actually quite interesting, and hopefully it will be evident how it impacts end users (such as in this 2018 experience report). The fundamental problemTypically an object's A traditional hash table mitigates clustering by applying a stronger bit mixer, such as Murmur3, to make the bit patterns of similar keys more distinct. The goal is that with a single bit flip, like the photo id increment, the output should appear effectively random. This is known as the avalanche effect and tools like hash-prospector try to find hash functions with a very low bias. We can see this in the output of their 2-round function (adjusted for Java by using an unsigned right shift), using
In a traditional hash flooding attack, the concern is that a lack of uniform distribution of entries across the hash table causes it to degrade to an O(n) lookup when searching a linked list. To protect the static hash function from being defeated by specially crafted keys it is often enhanced by using the application's startup system time (not wall clock) as an XOR seed to force random bit flips, e.g. Java's approach was to instead switch from linked list bins to red-black tree bins so that the worst-case is O(lg n), e.g. 1M entries require up to 20 comparisons and 1B requires up to 30 comparisons. This avoids any direct attack on the hash function. The goal of its mixing function, In either design, iterating over the hash table requires visiting every bin. The load factor means that it will double the array when the map reaches 75% of its expected capacity in order to reduce collisions into the same bins. Said another way, the load factor of 0.75 means that at least 25% of the table is always available (e.g. empty bins). While less critical for tree-based bins, maintaining free space contributes to near O(1) performance and, due to per-bin locking, higher write concurrency. Iterating over the hash table requires traversing the entire array and its bins, resulting in an O(N + number of bins) operation. This is not a common operation or the primary use-case for a Map, and especially not for large ones, so it is expected to have acceptable performance but not be optimal. The clustering causes most bins to be empty, which has little impact on the overall time complexity or performance of Map operations. The addition of eviction sampling invalidates the above design assumptions. In the provided example, at 4M entries the array length is 8M (2^23) and the non-uniform spreader causes only 26% of those to be bins (2,211,582 bins and 6,177,026 null slots). To perform an eviction, a uniformly random start index is chosen in the array and then it scans for bins to compare the first 8 entries found. As the key hash codes are allowed to cluster, the bins tend to form in groups within the table. In one case when attaching the debugger where the start index was 1,364,883, the traverser had to scan 967,573 null array slots before it found the next bin. Since the hash table has many large gaps where a search might randomly begin, the better-distributed entries are likely to be found first, leading to their bins being emptied and widening the gaps. The eviction effectively defragments the array by making the clustering worse and leads to skewing the policy towards an MRU because the older items are protected by its neighbors. Can the compiler and hardware help?One could argue that scanning contiguous memory is well-optimized by hardware and eviction may still be fast despite not being O(1). The problem with that is an over reliance on the hardware prefetcher and cache. An array length of 8M requires either 32MB or 64MB, depending on if the garbage collector uses compressed (32-bit) pointers for heaps less than 32GB. On a modern high-end machine this might fit into the L3 cache if lucky, but that is very unlikely due to the other application data like the entries themselves. For smaller workloads the impact will be reduced because the CPU cache can absorb it in a simulated environment, however in a cloud environment this is not likely to be as beneficial due to competing work. This can lead to many cache misses and slower execution times. Ideally, the prefetcher would assist when detecting a contiguous scan. For a sequential lookahead that could range from 2-8 KBs (32-128 cache lines) and for a distance of ~1M that means 4MB or 8MB of memory. Even if it aggressively fetches, since it is a simple null check the traversal would be limited by the DRAM latency of 50-100ns. That calculates as Unfortunately, optimization by the hardware and compiler is hindered by the concurrent nature of the data structure. This introduces memory barriers like volatile reads which prevents reordering memory accesses, restricts prefetching, and breaks the code shape for the compiler to apply its optimizations. Since the array is scanned sequentially at random offsets, the hardware cache likely won't see much reuse across concurrent evictions as they search different areas and the cpu cache may aggressively evict those entries due to sequential streaming detection in an effort to be scan resistant. MitigationsGetting back to the original problem, we saw that clustering caused the long traversals when trying to find 8 candidates to evaluate for eviction. There are two obvious solutions to this that would not require extra memory or a major change. The sampled eviction assumes that entries are uniformly distributed, per a traditional hash table, so that its random starting point can find non-empty bins quickly. A stronger hash function restores that characteristic and the runtime drops dramatically in real, non-malicious workloads. While ideally an internal change, users can premix their hash codes as a workaround. Another approach would be to reuse the iterator across evictions, relying on its weakly consistent behavior. This would avoid the random starts in the large gaps by galloping through them and spending more time within the clusters. his would require an internal change to serialize the eviction process and use a cyclical iterator, instead of the current concurrent eviction approach. This approach still suffers from the eviction latency costs of a large gap, but reduces how often that penalty is incurred. Combining the hash mixer and cyclical eviction iterator would give the best results. This is because while the iterator reuse does not rely on the hash quality, it benefits by having the uniform mixer remove those spans to traverse through. If a collision attack against the hash function was successful then its impact would be minimal, thereby making the effort not worthwhile. |
From Using a cache,
Ehcache 3 includes a trivial hash flooding denial of service attack, which was reported against 3.0-M4 in 2015. Unfortunately the authors simply resolved the issue without fixing, and then continued to ignore community concerns since then. When testing the latest release, 3.10.8, it still includes this problem. I do not think this should be the default cache, not because there are bugs, but because it has been 10 years of hiding a severe flaw instead of making a trivial fix.
The performance of their eviction policy degrades as the cache size increases due to a poor sampling implementation in their custom hash table. This implies that the more critical, performance sensitive caches are the most likely to be vulnerable and easiest to exploit due to their central usage. On my M3 Max laptop, Ehcache takes nearly 20 minutes in a database simulation, whereas Guava and Caffeine each take only 15 seconds or less. The runtime will be significantly worse on typical, virtualized cloud compute rather than a high performance laptop.
If we apply a simple patch of using the mix function in
SplittableRandom
, Stafford's variant 13, then the execution time drops to 40 seconds. Further improvements could be made by the developers, but users can mitigate this problem if they are careful with their hash codes.HashDoS demonstration
When running, we can observe this as the problem by using jstack
We can observe that other caches do not have this problem, such as Guava's LRU.
or Caffeine's W-TinyLFU,
HashDoS mitigation
You might notice that Ehcache's sampled LRU's hit rate decreased from 28% to more closely match Guava's LRU. This is because a database workload has a strong MRU / LFU bias, so the worse distribution randomly selects a better victim. This would be reverse in a recency-biased workload, such as in data streams.
Addendum
While I disclosed this issue in 2015, I do not interact with the Ehcache developers. They publicly called me a fraud and claimed that I faked and manipulated my benchmarks. These were baseless accusations with no technical foundation, but was intended to simply slander my project before it had much adoption. I find it no surprise that their project has withered instead, perhaps due to their lack of professionalism. While my project certainly had bugs and flaws that were found and addressed, and I have voiced concerns over alternative's claims, in every instance I have used data and an open, fair analysis to explain my beliefs. As JHipster is one of the few projects that still prefers Ehcache, I think it is worth you being aware of this problem, perhaps make progress in having it resolved, or reconsider the defaults for your users.The text was updated successfully, but these errors were encountered: