Cap eviction effort (CPU under stress) in HyperClockCache

Summary: HyperClockCache is intended to mitigate performance problems under stress conditions. In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem. However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system. To (IMHO) effectively mitigate this problem, I'm introducing a new default behavior and tuning parameter for HCC, eviction_effort_cap. See the comments on the new config parameter in the public API. Test Plan: unit test included ## Performance test We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7 ``` ``` Before, LRU (alternate baseline not exhibiting bad behavior): Rough parallel ops/sec = 2290997 1088060 maxresident Before, AutoHCC (bad behavior): Rough parallel ops/sec = 141011 <- Yes, more than 10x slower 1083932 maxresident ``` Now let us sample a range of values in the solution space: ``` After, AutoHCC, eviction_effort_cap = 1: Rough parallel ops/sec = 3212586 2402216 maxresident After, AutoHCC, eviction_effort_cap = 10: Rough parallel ops/sec = 2371639 1248884 maxresident After, AutoHCC, eviction_effort_cap = 30: Rough parallel ops/sec = 1981092 1131596 maxresident After, AutoHCC, eviction_effort_cap = 100: Rough parallel ops/sec = 1446188 1090976 maxresident After, AutoHCC, eviction_effort_cap = 1000: Rough parallel ops/sec = 549568 1084064 maxresident ``` I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default. ``` Change to -pinned_ratio=0.85 Before, LRU: Rough parallel ops/sec = 2108373 1078232 maxresident Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2164910 1077312 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2145542 1077216 maxresident ``` The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress. ``` Change to -pinned_ratio=0.25 (low stress) Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2221149 1076540 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2224521 1076664 maxresident ``` No measurable difference under normal circumstances. Some tests repeated with FixedHCC, with similar results.
pdillinger · Dec 13, 2023 · 9b9b784 · 9b9b784
1 parent ebb5242
commit 9b9b784
Show file tree

Hide file tree

Showing 6 changed files with 161 additions and 35 deletions.
diff --git a/cache/cache_bench_tool.cc b/cache/cache_bench_tool.cc
@@ -48,6 +48,10 @@ DEFINE_uint64(cache_size, 1 * GiB,
               "Number of bytes to use as a cache of uncompressed data.");
 DEFINE_int32(num_shard_bits, -1,
              "ShardedCacheOptions::shard_bits. Default = auto");
+DEFINE_int32(
+    eviction_effort_cap,
+    ROCKSDB_NAMESPACE::HyperClockCacheOptions(1, 1).eviction_effort_cap,
+    "HyperClockCacheOptions::eviction_effort_cap");
 
 DEFINE_double(resident_ratio, 0.25,
               "Ratio of keys fitting in cache to keyspace.");
@@ -391,6 +395,7 @@ class CacheBench {
           FLAGS_cache_size, /*estimated_entry_charge=*/0, FLAGS_num_shard_bits);
       opts.hash_seed = BitwiseAnd(FLAGS_seed, INT32_MAX);
       opts.memory_allocator = allocator;
+      opts.eviction_effort_cap = FLAGS_eviction_effort_cap;
       if (FLAGS_cache_type == "fixed_hyper_clock_cache" ||
           FLAGS_cache_type == "hyper_clock_cache") {
         opts.estimated_entry_charge = FLAGS_value_bytes_estimate > 0

diff --git a/cache/clock_cache.cc b/cache/clock_cache.cc
@@ -93,7 +93,8 @@ inline void Unref(const ClockHandle& h, uint64_t count = 1) {
   (void)old_meta;
 }
 
-inline bool ClockUpdate(ClockHandle& h, bool* purgeable = nullptr) {
+inline bool ClockUpdate(ClockHandle& h, BaseClockTable::EvictionData* data,
+                        bool* purgeable = nullptr) {
   uint64_t meta;
   if (purgeable) {
     assert(*purgeable == false);
@@ -125,6 +126,7 @@ inline bool ClockUpdate(ClockHandle& h, bool* purgeable = nullptr) {
       (meta >> ClockHandle::kReleaseCounterShift) & ClockHandle::kCounterMask;
   if (acquire_count != release_count) {
     // Only clock update entries with no outstanding refs
+    data->seen_pinned_count++;
     return false;
   }
   if ((meta >> ClockHandle::kStateShift == ClockHandle::kStateVisible) &&
@@ -148,6 +150,8 @@ inline bool ClockUpdate(ClockHandle& h, bool* purgeable = nullptr) {
                               << ClockHandle::kStateShift) |
                                  (meta & ClockHandle::kHitBitMask))) {
     // Took ownership.
+    data->freed_charge += h.GetTotalCharge();
+    data->freed_count += 1;
     return true;
   } else {
     // Compare-exchange failing probably
@@ -529,11 +533,7 @@ inline bool BaseClockTable::ChargeUsageMaybeEvictNonStrict(
   return true;
 }
 
-void BaseClockTable::TrackAndReleaseEvictedEntry(
-    ClockHandle* h, BaseClockTable::EvictionData* data) {
-  data->freed_charge += h->GetTotalCharge();
-  data->freed_count += 1;
-
+void BaseClockTable::TrackAndReleaseEvictedEntry(ClockHandle* h) {
   bool took_value_ownership = false;
   if (eviction_callback_) {
     // For key reconstructed from hash
@@ -550,6 +550,14 @@ void BaseClockTable::TrackAndReleaseEvictedEntry(
   MarkEmpty(*h);
 }
 
+bool BaseClockTable::IsEvictionEffortExceeded(const EvictionData& data) const {
+  // Basically checks whether the ratio of useful effort to wasted effort is
+  // too low, with a start-up allowance for wasted effort before any useful
+  // effort.
+  return (data.freed_count + 1) * eviction_effort_cap_ <=
+         data.seen_pinned_count;
+}
+
 template <class Table>
 Status BaseClockTable::Insert(const ClockHandleBasicData& proto,
                               typename Table::HandleImpl** handle,
@@ -692,7 +700,7 @@ FixedHyperClockTable::FixedHyperClockTable(
     MemoryAllocator* allocator,
     const Cache::EvictionCallback* eviction_callback, const uint32_t* hash_seed,
     const Opts& opts)
-    : BaseClockTable(metadata_charge_policy, allocator, eviction_callback,
+    : BaseClockTable(opts, metadata_charge_policy, allocator, eviction_callback,
                      hash_seed),
       length_bits_(CalcHashBits(capacity, opts.estimated_value_size,
                                 metadata_charge_policy)),
@@ -1104,10 +1112,10 @@ inline void FixedHyperClockTable::Evict(size_t requested_charge, InsertState&,
   for (;;) {
     for (size_t i = 0; i < step_size; i++) {
       HandleImpl& h = array_[ModTableSize(Lower32of64(old_clock_pointer + i))];
-      bool evicting = ClockUpdate(h);
+      bool evicting = ClockUpdate(h, data);
       if (evicting) {
         Rollback(h.hashed_key, &h);
-        TrackAndReleaseEvictedEntry(&h, data);
+        TrackAndReleaseEvictedEntry(&h);
       }
     }
 
@@ -1118,6 +1126,9 @@ inline void FixedHyperClockTable::Evict(size_t requested_charge, InsertState&,
     if (old_clock_pointer >= max_clock_pointer) {
       return;
     }
+    if (IsEvictionEffortExceeded(*data)) {
+      return;
+    }
 
     // Advance clock pointer (concurrently)
     old_clock_pointer = clock_pointer_.FetchAddRelaxed(step_size);
@@ -1912,7 +1923,7 @@ AutoHyperClockTable::AutoHyperClockTable(
     MemoryAllocator* allocator,
     const Cache::EvictionCallback* eviction_callback, const uint32_t* hash_seed,
     const Opts& opts)
-    : BaseClockTable(metadata_charge_policy, allocator, eviction_callback,
+    : BaseClockTable(opts, metadata_charge_policy, allocator, eviction_callback,
                      hash_seed),
       array_(MemMapping::AllocateLazyZeroed(
           sizeof(HandleImpl) * CalcMaxUsableLength(capacity,
@@ -2589,7 +2600,8 @@ using ClockUpdateChainLockedOpData =
 template <class OpData>
 void AutoHyperClockTable::PurgeImplLocked(OpData* op_data,
                                           ChainRewriteLock& rewrite_lock,
-                                          size_t home) {
+                                          size_t home,
+                                          BaseClockTable::EvictionData* data) {
   constexpr bool kIsPurge = std::is_same_v<OpData, PurgeLockedOpData>;
   constexpr bool kIsClockUpdateChain =
       std::is_same_v<OpData, ClockUpdateChainLockedOpData>;
@@ -2631,7 +2643,7 @@ void AutoHyperClockTable::PurgeImplLocked(OpData* op_data,
       assert(home == BottomNBits(h->hashed_key[1], home_shift));
       if constexpr (kIsClockUpdateChain) {
         // Clock update and/or check for purgeable (under (de)construction)
-        if (ClockUpdate(*h, &purgeable)) {
+        if (ClockUpdate(*h, data, &purgeable)) {
           // Remember for finishing eviction
           op_data->push_back(h);
           // Entries for eviction become purgeable
@@ -2718,7 +2730,8 @@ using PurgeOpData = const UniqueId64x2;
 using ClockUpdateChainOpData = ClockUpdateChainLockedOpData;
 
 template <class OpData>
-void AutoHyperClockTable::PurgeImpl(OpData* op_data, size_t home) {
+void AutoHyperClockTable::PurgeImpl(OpData* op_data, size_t home,
+                                    BaseClockTable::EvictionData* data) {
   // Early efforts to make AutoHCC fully wait-free ran into too many problems
   // that needed obscure and potentially inefficient work-arounds to have a
   // chance at working.
@@ -2799,9 +2812,9 @@ void AutoHyperClockTable::PurgeImpl(OpData* op_data, size_t home) {
   if (!rewrite_lock.IsEnd()) {
     if constexpr (kIsPurge) {
       PurgeLockedOpData* locked_op_data{};
-      PurgeImplLocked(locked_op_data, rewrite_lock, home);
+      PurgeImplLocked(locked_op_data, rewrite_lock, home, data);
     } else {
-      PurgeImplLocked(op_data, rewrite_lock, home);
+      PurgeImplLocked(op_data, rewrite_lock, home, data);
     }
   }
 }
@@ -3462,12 +3475,12 @@ void AutoHyperClockTable::Evict(size_t requested_charge, InsertState& state,
         if (home >= used_length) {
           break;
         }
-        PurgeImpl(&to_finish_eviction, home);
+        PurgeImpl(&to_finish_eviction, home, data);
       }
     }
 
     for (HandleImpl* h : to_finish_eviction) {
-      TrackAndReleaseEvictedEntry(h, data);
+      TrackAndReleaseEvictedEntry(h);
       // NOTE: setting likely_empty_slot here can cause us to reduce the
       // portion of "at home" entries, probably because an evicted entry
       // is more likely to come back than a random new entry and would be
@@ -3495,6 +3508,10 @@ void AutoHyperClockTable::Evict(size_t requested_charge, InsertState& state,
     if (old_clock_pointer + step_size >= max_clock_pointer) {
       return;
     }
+
+    if (IsEvictionEffortExceeded(*data)) {
+      return;
+    }
   }
 }
 

diff --git a/cache/clock_cache.h b/cache/clock_cache.h
@@ -374,13 +374,25 @@ struct ClockHandle : public ClockHandleBasicData {
 
 class BaseClockTable {
  public:
-  BaseClockTable(CacheMetadataChargePolicy metadata_charge_policy,
+  struct BaseOpts {
+    explicit BaseOpts(int _eviction_effort_cap)
+        : eviction_effort_cap(_eviction_effort_cap) {
+      eviction_effort_cap = std::max(int{1}, _eviction_effort_cap);
+    }
+    explicit BaseOpts(const HyperClockCacheOptions& opts)
+        : BaseOpts(opts.eviction_effort_cap) {}
+    int eviction_effort_cap;
+  };
+
+  BaseClockTable(const BaseOpts& opts,
+                 CacheMetadataChargePolicy metadata_charge_policy,
                  MemoryAllocator* allocator,
                  const Cache::EvictionCallback* eviction_callback,
                  const uint32_t* hash_seed)
       : metadata_charge_policy_(metadata_charge_policy),
         allocator_(allocator),
         eviction_callback_(*eviction_callback),
+        eviction_effort_cap_(opts.eviction_effort_cap),
         hash_seed_(*hash_seed) {}
 
   template <class Table>
@@ -409,9 +421,12 @@ class BaseClockTable {
   struct EvictionData {
     size_t freed_charge = 0;
     size_t freed_count = 0;
+    size_t seen_pinned_count = 0;
   };
 
-  void TrackAndReleaseEvictedEntry(ClockHandle* h, EvictionData* data);
+  void TrackAndReleaseEvictedEntry(ClockHandle* h);
+
+  bool IsEvictionEffortExceeded(const EvictionData& data) const;
 
 #ifndef NDEBUG
   // Acquire N references
@@ -450,7 +465,6 @@ class BaseClockTable {
   bool ChargeUsageMaybeEvictNonStrict(size_t total_charge, size_t capacity,
                                       bool need_evict_for_occupancy,
                                       typename Table::InsertState& state);
-
  protected:  // data
   // We partition the following members into different cache lines
   // to avoid false sharing among Lookup, Release, Erase and Insert
@@ -484,6 +498,9 @@ class BaseClockTable {
   // A reference to Cache::eviction_callback_
   const Cache::EvictionCallback& eviction_callback_;
 
+  // See HyperClockCacheOptions::eviction_effort_cap
+  int eviction_effort_cap_;
+
   // A reference to ShardedCacheBase::hash_seed_
   const uint32_t& hash_seed_;
 };
@@ -517,10 +534,12 @@ class FixedHyperClockTable : public BaseClockTable {
     inline void SetStandalone() { standalone = true; }
   };  // struct HandleImpl
 
-  struct Opts {
-    explicit Opts(size_t _estimated_value_size)
-        : estimated_value_size(_estimated_value_size) {}
-    explicit Opts(const HyperClockCacheOptions& opts) {
+  struct Opts : public BaseOpts {
+    explicit Opts(size_t _estimated_value_size, int _eviction_effort_cap)
+        : BaseOpts(_eviction_effort_cap),
+          estimated_value_size(_estimated_value_size) {}
+    explicit Opts(const HyperClockCacheOptions& opts)
+        : BaseOpts(opts.eviction_effort_cap) {
       assert(opts.estimated_entry_charge > 0);
       estimated_value_size = opts.estimated_entry_charge;
     }
@@ -803,11 +822,13 @@ class AutoHyperClockTable : public BaseClockTable {
     }
   };  // struct HandleImpl
 
-  struct Opts {
-    explicit Opts(size_t _min_avg_value_size)
-        : min_avg_value_size(_min_avg_value_size) {}
+  struct Opts : public BaseOpts {
+    explicit Opts(size_t _min_avg_value_size, int _eviction_effort_cap)
+        : BaseOpts(_eviction_effort_cap),
+          min_avg_value_size(_min_avg_value_size) {}
 
-    explicit Opts(const HyperClockCacheOptions& opts) {
+    explicit Opts(const HyperClockCacheOptions& opts)
+        : BaseOpts(opts.eviction_effort_cap) {
       assert(opts.estimated_entry_charge == 0);
       min_avg_value_size = opts.min_avg_entry_charge;
     }
@@ -906,7 +927,8 @@ class AutoHyperClockTable : public BaseClockTable {
   // with proper handling to ensure all existing data is seen even in the
   // presence of concurrent insertions, etc. (See implementation.)
   template <class OpData>
-  void PurgeImpl(OpData* op_data, size_t home = SIZE_MAX);
+  void PurgeImpl(OpData* op_data, size_t home = SIZE_MAX,
+                 EvictionData* data = nullptr);
 
   // An RAII wrapper for locking a chain of entries for removals. See
   // implementation.
@@ -916,7 +938,7 @@ class AutoHyperClockTable : public BaseClockTable {
   // implementation.
   template <class OpData>
   void PurgeImplLocked(OpData* op_data, ChainRewriteLock& rewrite_lock,
-                       size_t home);
+                       size_t home, EvictionData* data);
 
   // Update length_info_ as much as possible without waiting, given a known
   // usable (ready for inserts and lookups) grow_home. (Previous grow_homes

diff --git a/cache/compressed_secondary_cache_test.cc b/cache/compressed_secondary_cache_test.cc
@@ -992,6 +992,8 @@ class CompressedSecCacheTestWithTiered
         /*_capacity=*/0,
         /*_estimated_entry_charge=*/256 << 10,
         /*_num_shard_bits=*/0);
+    // eviction_effort_cap setting simply to avoid churn in existing test
+    hcc_opts.eviction_effort_cap = 100;
     TieredCacheOptions opts;
     lru_opts.capacity = 0;
     lru_opts.num_shard_bits = 0;

diff --git a/cache/lru_cache_test.cc b/cache/lru_cache_test.cc
@@ -389,12 +389,13 @@ class ClockCacheTest : public testing::Test {
     }
   }
 
-  void NewShard(size_t capacity, bool strict_capacity_limit = true) {
+  void NewShard(size_t capacity, bool strict_capacity_limit = true,
+                int eviction_effort_cap = 30) {
     DeleteShard();
     shard_ =
         reinterpret_cast<Shard*>(port::cacheline_aligned_alloc(sizeof(Shard)));
 
-    TableOpts opts{1 /*value_size*/};
+    TableOpts opts{1 /*value_size*/, eviction_effort_cap};
     new (shard_)
         Shard(capacity, strict_capacity_limit, kDontChargeCacheMetadata,
               /*allocator*/ nullptr, &eviction_callback_, &hash_seed_, opts);
@@ -445,12 +446,20 @@ class ClockCacheTest : public testing::Test {
     return Slice(reinterpret_cast<const char*>(&hashed_key), 16U);
   }
 
+  // A bad hash function for testing / stressing collision handling
   static inline UniqueId64x2 TestHashedKey(char key) {
     // For testing hash near-collision behavior, put the variance in
     // hashed_key in bits that are unlikely to be used as hash bits.
     return {(static_cast<uint64_t>(key) << 56) + 1234U, 5678U};
   }
 
+  // A reasonable hash function, for testing "typical behavior" etc.
+  template <typename T>
+  static inline UniqueId64x2 CheapHash(T i) {
+    return {static_cast<uint64_t>(i) * uint64_t{0x85EBCA77C2B2AE63},
+            static_cast<uint64_t>(i) * uint64_t{0xC2B2AE3D27D4EB4F}};
+  }
+
   Shard* shard_ = nullptr;
 
  private:
@@ -683,6 +692,43 @@ TYPED_TEST(ClockCacheTest, ClockEvictionTest) {
   }
 }
 
+TYPED_TEST(ClockCacheTest, ClockEvictionEffortCapTest) {
+  using HandleImpl = typename ClockCacheTest<TypeParam>::Shard::HandleImpl;
+  for (int eec : {-42, 0, 1, 10, 100, 1000}) {
+    SCOPED_TRACE("eviction_effort_cap = " + std::to_string(eec));
+    constexpr size_t kCapacity = 1000;
+    // Start with much larger capacity to ensure that we can go way over
+    // capacity without reaching table occupancy limit.
+    this->NewShard(3 * kCapacity, /*strict_capacity_limit=*/false, eec);
+    auto& shard = *this->shard_;
+    shard.SetCapacity(kCapacity);
+
+    // Nearly fill the cache with pinned entries, then add a bunch of
+    // non-pinned entries. eviction_effort_cap should affect how many
+    // evictable entries are present beyond the cache capacity, despite
+    // being evictable.
+    constexpr size_t kCount = kCapacity - 1;
+    std::unique_ptr<HandleImpl* []> ha { new HandleImpl* [kCount] {} };
+    for (size_t i = 0; i < 2 * kCount; ++i) {
+      UniqueId64x2 hkey = this->CheapHash(i);
+      ASSERT_OK(shard.Insert(
+          this->TestKey(hkey), hkey, nullptr /*value*/, &kNoopCacheItemHelper,
+          1 /*charge*/, i < kCount ? &ha[i] : nullptr, Cache::Priority::LOW));
+    }
+
+    // Rough inverse relationship between cap and possible memory
+    // explosion, which shows up as increased table occupancy count.
+    int effective_eec = std::max(int{1}, eec) + 1;
+    EXPECT_NEAR(shard.GetOccupancyCount() * 1.0,
+                kCount * (1 + 1.4 / effective_eec),
+                kCount * (0.6 / effective_eec) + 1.0);
+
+    for (size_t i = 0; i < kCount; ++i) {
+      shard.Release(ha[i]);
+    }
+  }
+}
+
 namespace {
 struct DeleteCounter {
   int deleted = 0;