Skip to content

Commit

Permalink
Cap eviction effort (CPU under stress) in HyperClockCache
Browse files Browse the repository at this point in the history
Summary: HyperClockCache is intended to mitigate performance problems
under stress conditions. In LRUCache, the biggest such problem is lock
contention when one or a small number of cache entries becomes
particularly hot. Regardless of cache sharding, accesses to any
particular cache entry are linearized against a single mutex, which is
held while each access updates the LRU list.  All HCC variants are fully
lock/wait-free for accessing blocks already in the cache, which fully
mitigates this contention problem.

However, HCC (and CLOCK in general) can exhibit extremely degraded
performance under a different stress condition: when no (or almost no)
entries in a cache shard are evictable (they are pinned). Unlike LRU
which can find any evictable entries immediately (at the cost of more
coordination / synchronization on each access), CLOCK has to search for
evictable entries. Under the right conditions (almost exclusively
MB-scale caches not GB-scale), the CPU cost of each cache miss could
fall off a cliff and bog down the whole system.

To (IMHO) effectively mitigate this problem, I'm introducing a new
default behavior and tuning parameter for HCC, eviction_effort_cap. See
the comments on the new config parameter in the public API.

Test Plan: unit test included

 ## Performance test

We can use cache_bench to validate no regression (CPU and memory) in
normal operation, and to measure change in behavior when cache is almost
entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio
parameter well over 1.0 to see truly bad performance, but the behavior
is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0
cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to
essentially remove jemalloc variances from the results, so that the max
RSS given by /usr/bin/time is essentially ideal (assuming the allocator
minimizes fragmentation and other memory overheads well). Base command
reproducing bad behavior:

```
./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7
```

```
Before, LRU (alternate baseline not exhibiting bad behavior):
Rough parallel ops/sec = 2290997
1088060 maxresident

Before, AutoHCC (bad behavior):
Rough parallel ops/sec = 141011 <- Yes, more than 10x slower
1083932 maxresident
```

Now let us sample a range of values in the solution space:

```
After, AutoHCC, eviction_effort_cap = 1:
Rough parallel ops/sec = 3212586
2402216 maxresident

After, AutoHCC, eviction_effort_cap = 10:
Rough parallel ops/sec = 2371639
1248884 maxresident

After, AutoHCC, eviction_effort_cap = 30:
Rough parallel ops/sec = 1981092
1131596 maxresident

After, AutoHCC, eviction_effort_cap = 100:
Rough parallel ops/sec = 1446188
1090976 maxresident

After, AutoHCC, eviction_effort_cap = 1000:
Rough parallel ops/sec = 549568
1084064 maxresident
```

I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so
is chosen as the default.

```
Change to -pinned_ratio=0.85
Before, LRU:
Rough parallel ops/sec = 2108373
1078232 maxresident

Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2164910
1077312 maxresident

After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2145542
1077216 maxresident
```

The slight CPU improvement above is consistent with the cap, with no
measurable memory overhead under moderate stress.

```
Change to -pinned_ratio=0.25 (low stress)
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2221149
1076540 maxresident

After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2224521
1076664 maxresident
```

No measurable difference under normal circumstances.

Some tests repeated with FixedHCC, with similar results.
  • Loading branch information
pdillinger committed Dec 13, 2023
1 parent ebb5242 commit 9b9b784
Show file tree
Hide file tree
Showing 6 changed files with 161 additions and 35 deletions.
5 changes: 5 additions & 0 deletions cache/cache_bench_tool.cc
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,10 @@ DEFINE_uint64(cache_size, 1 * GiB,
"Number of bytes to use as a cache of uncompressed data.");
DEFINE_int32(num_shard_bits, -1,
"ShardedCacheOptions::shard_bits. Default = auto");
DEFINE_int32(
eviction_effort_cap,
ROCKSDB_NAMESPACE::HyperClockCacheOptions(1, 1).eviction_effort_cap,
"HyperClockCacheOptions::eviction_effort_cap");

DEFINE_double(resident_ratio, 0.25,
"Ratio of keys fitting in cache to keyspace.");
Expand Down Expand Up @@ -391,6 +395,7 @@ class CacheBench {
FLAGS_cache_size, /*estimated_entry_charge=*/0, FLAGS_num_shard_bits);
opts.hash_seed = BitwiseAnd(FLAGS_seed, INT32_MAX);
opts.memory_allocator = allocator;
opts.eviction_effort_cap = FLAGS_eviction_effort_cap;
if (FLAGS_cache_type == "fixed_hyper_clock_cache" ||
FLAGS_cache_type == "hyper_clock_cache") {
opts.estimated_entry_charge = FLAGS_value_bytes_estimate > 0
Expand Down
51 changes: 34 additions & 17 deletions cache/clock_cache.cc
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,8 @@ inline void Unref(const ClockHandle& h, uint64_t count = 1) {
(void)old_meta;
}

inline bool ClockUpdate(ClockHandle& h, bool* purgeable = nullptr) {
inline bool ClockUpdate(ClockHandle& h, BaseClockTable::EvictionData* data,
bool* purgeable = nullptr) {
uint64_t meta;
if (purgeable) {
assert(*purgeable == false);
Expand Down Expand Up @@ -125,6 +126,7 @@ inline bool ClockUpdate(ClockHandle& h, bool* purgeable = nullptr) {
(meta >> ClockHandle::kReleaseCounterShift) & ClockHandle::kCounterMask;
if (acquire_count != release_count) {
// Only clock update entries with no outstanding refs
data->seen_pinned_count++;
return false;
}
if ((meta >> ClockHandle::kStateShift == ClockHandle::kStateVisible) &&
Expand All @@ -148,6 +150,8 @@ inline bool ClockUpdate(ClockHandle& h, bool* purgeable = nullptr) {
<< ClockHandle::kStateShift) |
(meta & ClockHandle::kHitBitMask))) {
// Took ownership.
data->freed_charge += h.GetTotalCharge();
data->freed_count += 1;
return true;
} else {
// Compare-exchange failing probably
Expand Down Expand Up @@ -529,11 +533,7 @@ inline bool BaseClockTable::ChargeUsageMaybeEvictNonStrict(
return true;
}

void BaseClockTable::TrackAndReleaseEvictedEntry(
ClockHandle* h, BaseClockTable::EvictionData* data) {
data->freed_charge += h->GetTotalCharge();
data->freed_count += 1;

void BaseClockTable::TrackAndReleaseEvictedEntry(ClockHandle* h) {
bool took_value_ownership = false;
if (eviction_callback_) {
// For key reconstructed from hash
Expand All @@ -550,6 +550,14 @@ void BaseClockTable::TrackAndReleaseEvictedEntry(
MarkEmpty(*h);
}

bool BaseClockTable::IsEvictionEffortExceeded(const EvictionData& data) const {
// Basically checks whether the ratio of useful effort to wasted effort is
// too low, with a start-up allowance for wasted effort before any useful
// effort.
return (data.freed_count + 1) * eviction_effort_cap_ <=
data.seen_pinned_count;
}

template <class Table>
Status BaseClockTable::Insert(const ClockHandleBasicData& proto,
typename Table::HandleImpl** handle,
Expand Down Expand Up @@ -692,7 +700,7 @@ FixedHyperClockTable::FixedHyperClockTable(
MemoryAllocator* allocator,
const Cache::EvictionCallback* eviction_callback, const uint32_t* hash_seed,
const Opts& opts)
: BaseClockTable(metadata_charge_policy, allocator, eviction_callback,
: BaseClockTable(opts, metadata_charge_policy, allocator, eviction_callback,
hash_seed),
length_bits_(CalcHashBits(capacity, opts.estimated_value_size,
metadata_charge_policy)),
Expand Down Expand Up @@ -1104,10 +1112,10 @@ inline void FixedHyperClockTable::Evict(size_t requested_charge, InsertState&,
for (;;) {
for (size_t i = 0; i < step_size; i++) {
HandleImpl& h = array_[ModTableSize(Lower32of64(old_clock_pointer + i))];
bool evicting = ClockUpdate(h);
bool evicting = ClockUpdate(h, data);
if (evicting) {
Rollback(h.hashed_key, &h);
TrackAndReleaseEvictedEntry(&h, data);
TrackAndReleaseEvictedEntry(&h);
}
}

Expand All @@ -1118,6 +1126,9 @@ inline void FixedHyperClockTable::Evict(size_t requested_charge, InsertState&,
if (old_clock_pointer >= max_clock_pointer) {
return;
}
if (IsEvictionEffortExceeded(*data)) {
return;
}

// Advance clock pointer (concurrently)
old_clock_pointer = clock_pointer_.FetchAddRelaxed(step_size);
Expand Down Expand Up @@ -1912,7 +1923,7 @@ AutoHyperClockTable::AutoHyperClockTable(
MemoryAllocator* allocator,
const Cache::EvictionCallback* eviction_callback, const uint32_t* hash_seed,
const Opts& opts)
: BaseClockTable(metadata_charge_policy, allocator, eviction_callback,
: BaseClockTable(opts, metadata_charge_policy, allocator, eviction_callback,
hash_seed),
array_(MemMapping::AllocateLazyZeroed(
sizeof(HandleImpl) * CalcMaxUsableLength(capacity,
Expand Down Expand Up @@ -2589,7 +2600,8 @@ using ClockUpdateChainLockedOpData =
template <class OpData>
void AutoHyperClockTable::PurgeImplLocked(OpData* op_data,
ChainRewriteLock& rewrite_lock,
size_t home) {
size_t home,
BaseClockTable::EvictionData* data) {
constexpr bool kIsPurge = std::is_same_v<OpData, PurgeLockedOpData>;
constexpr bool kIsClockUpdateChain =
std::is_same_v<OpData, ClockUpdateChainLockedOpData>;
Expand Down Expand Up @@ -2631,7 +2643,7 @@ void AutoHyperClockTable::PurgeImplLocked(OpData* op_data,
assert(home == BottomNBits(h->hashed_key[1], home_shift));
if constexpr (kIsClockUpdateChain) {
// Clock update and/or check for purgeable (under (de)construction)
if (ClockUpdate(*h, &purgeable)) {
if (ClockUpdate(*h, data, &purgeable)) {
// Remember for finishing eviction
op_data->push_back(h);
// Entries for eviction become purgeable
Expand Down Expand Up @@ -2718,7 +2730,8 @@ using PurgeOpData = const UniqueId64x2;
using ClockUpdateChainOpData = ClockUpdateChainLockedOpData;

template <class OpData>
void AutoHyperClockTable::PurgeImpl(OpData* op_data, size_t home) {
void AutoHyperClockTable::PurgeImpl(OpData* op_data, size_t home,
BaseClockTable::EvictionData* data) {
// Early efforts to make AutoHCC fully wait-free ran into too many problems
// that needed obscure and potentially inefficient work-arounds to have a
// chance at working.
Expand Down Expand Up @@ -2799,9 +2812,9 @@ void AutoHyperClockTable::PurgeImpl(OpData* op_data, size_t home) {
if (!rewrite_lock.IsEnd()) {
if constexpr (kIsPurge) {
PurgeLockedOpData* locked_op_data{};
PurgeImplLocked(locked_op_data, rewrite_lock, home);
PurgeImplLocked(locked_op_data, rewrite_lock, home, data);
} else {
PurgeImplLocked(op_data, rewrite_lock, home);
PurgeImplLocked(op_data, rewrite_lock, home, data);
}
}
}
Expand Down Expand Up @@ -3462,12 +3475,12 @@ void AutoHyperClockTable::Evict(size_t requested_charge, InsertState& state,
if (home >= used_length) {
break;
}
PurgeImpl(&to_finish_eviction, home);
PurgeImpl(&to_finish_eviction, home, data);
}
}

for (HandleImpl* h : to_finish_eviction) {
TrackAndReleaseEvictedEntry(h, data);
TrackAndReleaseEvictedEntry(h);
// NOTE: setting likely_empty_slot here can cause us to reduce the
// portion of "at home" entries, probably because an evicted entry
// is more likely to come back than a random new entry and would be
Expand Down Expand Up @@ -3495,6 +3508,10 @@ void AutoHyperClockTable::Evict(size_t requested_charge, InsertState& state,
if (old_clock_pointer + step_size >= max_clock_pointer) {
return;
}

if (IsEvictionEffortExceeded(*data)) {
return;
}
}
}

Expand Down
48 changes: 35 additions & 13 deletions cache/clock_cache.h
Original file line number Diff line number Diff line change
Expand Up @@ -374,13 +374,25 @@ struct ClockHandle : public ClockHandleBasicData {

class BaseClockTable {
public:
BaseClockTable(CacheMetadataChargePolicy metadata_charge_policy,
struct BaseOpts {
explicit BaseOpts(int _eviction_effort_cap)
: eviction_effort_cap(_eviction_effort_cap) {
eviction_effort_cap = std::max(int{1}, _eviction_effort_cap);
}
explicit BaseOpts(const HyperClockCacheOptions& opts)
: BaseOpts(opts.eviction_effort_cap) {}
int eviction_effort_cap;
};

BaseClockTable(const BaseOpts& opts,
CacheMetadataChargePolicy metadata_charge_policy,
MemoryAllocator* allocator,
const Cache::EvictionCallback* eviction_callback,
const uint32_t* hash_seed)
: metadata_charge_policy_(metadata_charge_policy),
allocator_(allocator),
eviction_callback_(*eviction_callback),
eviction_effort_cap_(opts.eviction_effort_cap),
hash_seed_(*hash_seed) {}

template <class Table>
Expand Down Expand Up @@ -409,9 +421,12 @@ class BaseClockTable {
struct EvictionData {
size_t freed_charge = 0;
size_t freed_count = 0;
size_t seen_pinned_count = 0;
};

void TrackAndReleaseEvictedEntry(ClockHandle* h, EvictionData* data);
void TrackAndReleaseEvictedEntry(ClockHandle* h);

bool IsEvictionEffortExceeded(const EvictionData& data) const;

#ifndef NDEBUG
// Acquire N references
Expand Down Expand Up @@ -450,7 +465,6 @@ class BaseClockTable {
bool ChargeUsageMaybeEvictNonStrict(size_t total_charge, size_t capacity,
bool need_evict_for_occupancy,
typename Table::InsertState& state);

protected: // data
// We partition the following members into different cache lines
// to avoid false sharing among Lookup, Release, Erase and Insert
Expand Down Expand Up @@ -484,6 +498,9 @@ class BaseClockTable {
// A reference to Cache::eviction_callback_
const Cache::EvictionCallback& eviction_callback_;

// See HyperClockCacheOptions::eviction_effort_cap
int eviction_effort_cap_;

// A reference to ShardedCacheBase::hash_seed_
const uint32_t& hash_seed_;
};
Expand Down Expand Up @@ -517,10 +534,12 @@ class FixedHyperClockTable : public BaseClockTable {
inline void SetStandalone() { standalone = true; }
}; // struct HandleImpl

struct Opts {
explicit Opts(size_t _estimated_value_size)
: estimated_value_size(_estimated_value_size) {}
explicit Opts(const HyperClockCacheOptions& opts) {
struct Opts : public BaseOpts {
explicit Opts(size_t _estimated_value_size, int _eviction_effort_cap)
: BaseOpts(_eviction_effort_cap),
estimated_value_size(_estimated_value_size) {}
explicit Opts(const HyperClockCacheOptions& opts)
: BaseOpts(opts.eviction_effort_cap) {
assert(opts.estimated_entry_charge > 0);
estimated_value_size = opts.estimated_entry_charge;
}
Expand Down Expand Up @@ -803,11 +822,13 @@ class AutoHyperClockTable : public BaseClockTable {
}
}; // struct HandleImpl

struct Opts {
explicit Opts(size_t _min_avg_value_size)
: min_avg_value_size(_min_avg_value_size) {}
struct Opts : public BaseOpts {
explicit Opts(size_t _min_avg_value_size, int _eviction_effort_cap)
: BaseOpts(_eviction_effort_cap),
min_avg_value_size(_min_avg_value_size) {}

explicit Opts(const HyperClockCacheOptions& opts) {
explicit Opts(const HyperClockCacheOptions& opts)
: BaseOpts(opts.eviction_effort_cap) {
assert(opts.estimated_entry_charge == 0);
min_avg_value_size = opts.min_avg_entry_charge;
}
Expand Down Expand Up @@ -906,7 +927,8 @@ class AutoHyperClockTable : public BaseClockTable {
// with proper handling to ensure all existing data is seen even in the
// presence of concurrent insertions, etc. (See implementation.)
template <class OpData>
void PurgeImpl(OpData* op_data, size_t home = SIZE_MAX);
void PurgeImpl(OpData* op_data, size_t home = SIZE_MAX,
EvictionData* data = nullptr);

// An RAII wrapper for locking a chain of entries for removals. See
// implementation.
Expand All @@ -916,7 +938,7 @@ class AutoHyperClockTable : public BaseClockTable {
// implementation.
template <class OpData>
void PurgeImplLocked(OpData* op_data, ChainRewriteLock& rewrite_lock,
size_t home);
size_t home, EvictionData* data);

// Update length_info_ as much as possible without waiting, given a known
// usable (ready for inserts and lookups) grow_home. (Previous grow_homes
Expand Down
2 changes: 2 additions & 0 deletions cache/compressed_secondary_cache_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -992,6 +992,8 @@ class CompressedSecCacheTestWithTiered
/*_capacity=*/0,
/*_estimated_entry_charge=*/256 << 10,
/*_num_shard_bits=*/0);
// eviction_effort_cap setting simply to avoid churn in existing test
hcc_opts.eviction_effort_cap = 100;
TieredCacheOptions opts;
lru_opts.capacity = 0;
lru_opts.num_shard_bits = 0;
Expand Down
50 changes: 48 additions & 2 deletions cache/lru_cache_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -389,12 +389,13 @@ class ClockCacheTest : public testing::Test {
}
}

void NewShard(size_t capacity, bool strict_capacity_limit = true) {
void NewShard(size_t capacity, bool strict_capacity_limit = true,
int eviction_effort_cap = 30) {
DeleteShard();
shard_ =
reinterpret_cast<Shard*>(port::cacheline_aligned_alloc(sizeof(Shard)));

TableOpts opts{1 /*value_size*/};
TableOpts opts{1 /*value_size*/, eviction_effort_cap};
new (shard_)
Shard(capacity, strict_capacity_limit, kDontChargeCacheMetadata,
/*allocator*/ nullptr, &eviction_callback_, &hash_seed_, opts);
Expand Down Expand Up @@ -445,12 +446,20 @@ class ClockCacheTest : public testing::Test {
return Slice(reinterpret_cast<const char*>(&hashed_key), 16U);
}

// A bad hash function for testing / stressing collision handling
static inline UniqueId64x2 TestHashedKey(char key) {
// For testing hash near-collision behavior, put the variance in
// hashed_key in bits that are unlikely to be used as hash bits.
return {(static_cast<uint64_t>(key) << 56) + 1234U, 5678U};
}

// A reasonable hash function, for testing "typical behavior" etc.
template <typename T>
static inline UniqueId64x2 CheapHash(T i) {
return {static_cast<uint64_t>(i) * uint64_t{0x85EBCA77C2B2AE63},
static_cast<uint64_t>(i) * uint64_t{0xC2B2AE3D27D4EB4F}};
}

Shard* shard_ = nullptr;

private:
Expand Down Expand Up @@ -683,6 +692,43 @@ TYPED_TEST(ClockCacheTest, ClockEvictionTest) {
}
}

TYPED_TEST(ClockCacheTest, ClockEvictionEffortCapTest) {
using HandleImpl = typename ClockCacheTest<TypeParam>::Shard::HandleImpl;
for (int eec : {-42, 0, 1, 10, 100, 1000}) {
SCOPED_TRACE("eviction_effort_cap = " + std::to_string(eec));
constexpr size_t kCapacity = 1000;
// Start with much larger capacity to ensure that we can go way over
// capacity without reaching table occupancy limit.
this->NewShard(3 * kCapacity, /*strict_capacity_limit=*/false, eec);
auto& shard = *this->shard_;
shard.SetCapacity(kCapacity);

// Nearly fill the cache with pinned entries, then add a bunch of
// non-pinned entries. eviction_effort_cap should affect how many
// evictable entries are present beyond the cache capacity, despite
// being evictable.
constexpr size_t kCount = kCapacity - 1;
std::unique_ptr<HandleImpl* []> ha { new HandleImpl* [kCount] {} };
for (size_t i = 0; i < 2 * kCount; ++i) {
UniqueId64x2 hkey = this->CheapHash(i);
ASSERT_OK(shard.Insert(
this->TestKey(hkey), hkey, nullptr /*value*/, &kNoopCacheItemHelper,
1 /*charge*/, i < kCount ? &ha[i] : nullptr, Cache::Priority::LOW));
}

// Rough inverse relationship between cap and possible memory
// explosion, which shows up as increased table occupancy count.
int effective_eec = std::max(int{1}, eec) + 1;
EXPECT_NEAR(shard.GetOccupancyCount() * 1.0,
kCount * (1 + 1.4 / effective_eec),
kCount * (0.6 / effective_eec) + 1.0);

for (size_t i = 0; i < kCount; ++i) {
shard.Release(ha[i]);
}
}
}

namespace {
struct DeleteCounter {
int deleted = 0;
Expand Down
Loading

0 comments on commit 9b9b784

Please sign in to comment.