coverY | layout | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
|
ScyllaDB would like to publicly acknowledge dormando (Memcached maintainer) and Danny Kopping for their contributions to this project, as well as thank them for their support and patience.
Engineers behind ScyllaDB – the database for predictable performance at scale – joined forces with Memcached maintainer dormando to compare both technologies head-to-head, in a collaborative vendor-neutral way.
The results reveal that:
- Both Memcached and ScyllaDB maximized disks and network bandwidth while being stressed under similar conditions, sustaining similar performance overall.
- While ScyllaDB required data modeling changes to fully saturate the network throughput, Memcached required additional IO threads to saturate disk I/O.
- Although ScyllaDB showed better latencies when compared to Memcached for pipelined requests to disk, Memcached latencies were better for individual requests.
This document explains our motivation for these tests, provides a detailed look at the tested scenarios and results, then presents recommendations for anyone who might be deciding between ScyllaDB and Memcached. Along the way, we analyze the architectural differences behind these two solutions and discuss the tradeoffs involved in each.
Bonus: dormando and I will be discussing this project at P99 CONF, a highly technical conference on performance and low latency engineering. It’s free and virtual, October 23 and 24. I invite you to join, and bring your questions for us!
If you already read the blog, skip ahead to the additional details added here:
First and foremost, ScyllaDB invested lots of time and engineering resources optimizing our database to deliver predictable low latencies for real-time data-intensive applications. ScyllaDB’s shard-per-core, shared-nothing architecture, userspace I/O scheduler and internal cache implementation (fully bypassing the Linux page cache) are some notable examples of such optimizations.
Second: performance converges over time. In-memory caches have been (for a long time) regarded as one of the fastest infrastructure components around. Yet, it's been a few years now since caching solutions started to look into the realm of flash disks. These initiatives obviously pose an interesting question: If an in-memory cache can rely on flash storage, then why can't a persistent database also work as a cache?
Third: We previously discussed 7 Reasons Not to Put a Cache in Front of Your Database and recently explored how specific teams have successfully replaced their caches with ScyllaDB.
Fourth: At last year’s P99 CONF, Danny Kopping gave us an enlightening talk, Cache Me If You Can, where he explained how Memcached Extstore helped Grafana Labs scale their cache footprint 42x while driving costs down.
And finally, despite the (valid) criticism that performance benchmarks receive, they still play an important role in driving innovation. Benchmarks are a useful resource for engineers seeking in-house optimization opportunities.
Now, on to the comparison.
Tests were carried out using the following AWS instance types:
- Loader: c7i.16xlarge (64 vCPUs, 128GB RAM)
- Memcached: i4i.4xlarge (16 vCPUs, 128GB RAM, 3.75TB NVMe)
- ScyllaDB: i4i.4xlarge (16 vCPUs, 128GB RAM, 3.75TB NVMe)
All instances can deliver up to 25Gbps of network bandwidth.1
To overcome potential bottlenecks, the following optimizations and settings were applied:
- AWS side: All instances used a Cluster placement strategy, following the AWS Docs:
"This strategy enables workloads to achieve the low-latency network performance necessary for tightly-coupled node-to-node communication that is typical of high-performance computing (HPC) applications."
- Memcached: Version 1.6.25, compiled with Extstore enabled. Except where denoted, run with 14 threads, pinned to specific CPUs. The remaining 2 vCPUs were assigned to CPU 0 (core & HT sibling) to handle Network IRQs, as specified by the sq_split mode in seastar perftune.py. CAS operations were disabled to save space on per-item overhead. The full command line arguments were:
taskset -c 1-7,9-15 /usr/local/memcached/bin/memcached -v -A -r -m 114100 -c 4096 --lock-memory --threads 14 -u scylla -C
- ScyllaDB: Default settings as configured by ScyllaDB Enterprise 2024.1.2 AMI (ami-id: ami-018335b47ba6bdf9a) in a i4i.4xlarge. This includes the same CPU pinning settings as described above to Memcached.
For Memcached loaders, we used mcshredder, part of Memcached's official testing suite. The applicable stressing profiles are in the fee-mendes/shredders GitHub repository.
For ScyllaDB, we used cassandra-stress, as shipped with ScyllaDB, and specified comparable workloads as the ones used for Memcached.
Note: For a summary of the tests and results, see the blog: We Compared ScyllaDB and Memcached and... We Lost?
ScyllaDB and Memcached tests were split across two groups: In-Memory and On-Disk.
We start by comparing both solutions' cache footprint and how they performed under load during a read-only workload.
Past the cache-only part, we then introduce Extstore, a "Feature for extending memcached's memory space onto flash (or similar) storage".
The more items you can fit into RAM, the better your chance of getting cache hits. More cache hits result in significantly faster access than going to disk. Ultimately, that improves latency.
RAM is also significantly more expensive than a GB of Flash Storage in Cloud environments. Danny Kopping estimated a cost of roughly USD $4.43/GB for RAM on GCP n2-standard machines during his talk, whereas the price for Flash Storage was down to USD $0.08/GB.
Memcached design philosophy is simple: It is a key-value store with no concept of synchronization, broadcasting, or replication. The more servers you have, the more caching space you get. Although you could technically implement replication on your own, the best approach would be to simply failover (with some caveats) instead and tolerate a few cache misses during a failure situation.
ScyllaDB lives on the other side of the spectrum: It is a wide-column persistent storage, with synchronization and replication features built into its architecture. In that regard, more servers doesn't necessarily mean increased caching space: Aspects such as the replication factor and consistency levels come into play. If you replicate data across nodes, you can tolerate the failure of nodes and prevent cache misses on frequently accessed data, but this typically halves your effective cache size.
This project began by measuring how many items we could store to each datastore. Throughout our tests, the key was between 4 to 12 bytes (key0 .. keyN) for Memcached, and 12 bytes for ScyllaDB. The value was fixed to 1000 bytes.
Memcached stored roughly 101M items until eviction started, achieving very high memory efficiency. Out of Memcached’s 114G assigned memory, this is approximately 101G worth of values, without considering the key size and other flags:
Memcached stored 101M items in memory before evictions started
The memcached STATS output shows how many items and bytes it ended up consuming:
STAT bytes 107764057859
STAT curr_items 100978499
It is worth noting that each item in memcached introduces a small overhead. To measure it, memcached provides a sizes utility. Its output was as follows in our server machine:
~/memcached-1.6.25# ./sizes
Slab Stats 64
Thread stats 2352
Global stats 224
Settings 336
Item (no cas) 48
Item (cas) 56
extstore header 12
Libevent thread 488
Connection 496
Response object 1184
Response bundle 32
Response objects per bundle 13
----------------------------------------
libevent thread cumulative 6936
Thread stats cumulative 6448
For our testing (with CAS disabled), each item introduced an overhead of 48 bytes, which translates to roughly 4.85GB RAM.
ScyllaDB stored between 60 to 61M items before evictions started to happen. This is no surprise, given that its protocol requires more data to be stored as part of a write (such as the write timestamp since epoch, row liveness, etc). ScyllaDB also persists data to disk as you go, which means that Bloom Filters (and optionally Indexes) need to be stored in memory for later efficient disk lookups.
Eviction starts under memory pressure while trying to load 61M rows
Another difference (and source of overhead) relates to supporting a richer wide-column data model. Note how the number of rows is double the number of partitions stored in the ScyllaDB cache. This happens because we require storing range continuity information, as we explained in this article. However, we found out that we don't need to store dummy rows for continuity information on tables with no clustering keys, and scylladb/2972 turns out to be an interesting optimization in that regard.
If you’re reading closely, then you probably realized that we mentioned ScyllaDB stored 60M records, but the provided monitoring snapshot shows evictions happened when we had roughly 52M items within the cache. What happened to the other 8M rows?
ScyllaDB first stores data within memtables; only during a flush is this data populated and merged with its cache contents. A memtable flush required evicting records from the cache to free memory. However, a memtable is by definition an in-memory structure. Thus, we need to account for the rows within the cache until the last flush prior to eviction, plus the memtable contents.
No evictions at 60M items, 53.3M partitions in Cache, whereas the rest are within memtables
Take this Monitoring snapshot, where no evictions happened. In this test, we've populated 60M unique records, whereas in the previous one we populated 61M records. ScyllaDB unfortunately doesn't provide a way to measure the number of partitions within memtables as it is part of the hot write path, but there are ways to estimate that number (such as how we've done here).
ScyllaDB memory is split in two categories:
- LSA (Log Structured Allocator) – Used by the cache and memtables
- Non LSA – Regular memory
Here, we can see that both the cache and memtables are fully utilizing the server's memory, whereas other components (such as Bloom Filters) consume roughly 2.3GB of RAM.
The ideal (though unrealistic) workload for a cache is one where data fits in RAM so that reads don't require disk access and no evictions or misses occur. Both ScyllaDB and Memcached employ LRU (Least Recently Used) logic for freeing up memory: When the system runs under pressure, items get evicted from the LRU's tail, which are typically the least active items.
Taking evictions and cache misses out of the picture helps measure and set a performance baseline for both datastores. It places the focus on what matters most for these kinds of workloads: read throughput and request latency.
In this test, we first warmed up both stores with the same payload sizes used during the previous test. Then, we initiated reads against their respective ranges for 30 minutes.
A great feature of the memcached protocol is pipelining of requests, which it has supported since its early days. This allows a client to pack multiple requests under a single TCP request, resulting in less network RTTs.
For this test, we've run the mcshredder perfrun_metaget_pipe workload with 32 threads, 40 clients and a pipeline depth of 16 items. You can find the (long) output of that run in the GitHub repo.
Memcached achieved an impressive 3 Million Gets per second, fully maximizing AWS NIC bandwidth (25 Gbps)!
Memcached kept a steady 3M rps, fully maximizing the NIC throughput
We then parsed the test output into our modified version of perf-parser.pl, resulting in a histogram showing that the p99.999 of responses completed below 1ms:
--- subtest basic ---
stat: cmd_get :
Total Ops: 5503513496
Rate: 3060908/s
=== timer mg ===
1-10us 0 0.000%
10-99us 343504394 6.238%
100-999us 5163057634 93.762%
1-2ms 11500 0.00021%
2-3ms 144 0.00000%
3-4ms 336 0.00001%
4-5ms 400 0.00001%
11-12ms 32 0.00000%
At this point, it became clear networking is a bottleneck and Memcached likely would've been able to deliver even higher throughputs through a faster link (or without networking at all). For example, on this HackerNews thread, dormando mentioned that he could scale it up past 55 million read ops/sec for a considerably larger server.
As our main testing focused on performance between distributed systems, we decided not to stress memcached locally. However, be aware of the cost optimization opportunities involved: as Memcached requires much less CPU for a similar workload, you can switch to smaller instance sizes for a better cost return.
Contrary to Memcached, ScyllaDB – or more specifically the CQL protocol (the subject of our testing) does not introduce a concept of pipelining requests. In that sense, on top of the overheads discussed during the previous test, achieving a similar throughput under a simple key-value workload turns out to be challenging.
While Memcached's pipelining works by batching many requests under a single network call, the CQL protocol (and to a greater extent, ScyllaDB's shard-per-core architecture), makes this an anti-pattern.
Since memcached pipelining requires less client-server round-trips (and thus less Network packets), ScyllaDB clients would require an order of magnitude more requests to achieve the same GET rates seen in Memcached. In turn, this easily saturates the NIC bandwidth and/or Network max packets per second (pps).
It is worth noting that – while the CQL protocol DOES allow one to read from multiple keys within a single query (via the SELECT IN (...) clause), relying on it wouldn't be optimal either. Selecting multiple keys under a single request would have a high probability of spanning multiple replica shards. A single operation would need to be handled by a single coordinator, thus introducing cross CPU traffic and potentially elevating latencies.
At this point, we needed to come up with a better data model for ScyllaDB where a single client operation could result in more "GET" requests (or – in ScyllaDB terms, more rows being read per operation). After considering the problem from a different angle for a few minutes, we came up with a simple answer for it: Group 16 rows per partition (our memcached pipeline depth) with the help of a clustering key, where each partition hit would result in the server returning us 16 rows.
A similar Memcached workload would simply associate the data from all rows with one key and scale accordingly. However, in doing so, this important difference among both solutions wouldn't stand out – effectively showing Memcached as a more performant solution for single and small key-value lookups.
With a clustering key, we could fully maximize ScyllaDB's cache, resulting in a significant improvement in the number of cached rows. We ingested 5M partitions, each with 16 clustering keys, for a total of 80M cached rows.
As a result, the number of records within the cache significantly improved (note that the previous memtable explanation still applies here). We've also confirmed that the number of total rows matched our expectations prior to our read tests. To do that, we ran an efficient full table scan which returned 80M rows.
ScyllaDB Cache utilization after initial data ingestion
With these adjustments, our loaders ran a total of 187K read ops/second over 30 minutes. Each operation resulted in 16 rows getting retrieved.
Similarly to memcached, ScyllaDB also maximized the NICs throughput. It served roughly 3M rows/second solely from in-memory data:
ScyllaDB Server Network Traffic as reported by node_exporter
Number of read operations (left) and rows being hit (right) from cache during the exercise
ScyllaDB exposes server-side latency information, which is useful for analyzing latency without the network. During the test, ScyllaDB's server-side p99 latency remained within 1ms bounds:
Latency and Network traffic from ScyllaDB matching the adjustments done
The client-side percentiles are, unsurprisingly, higher than the server-side latency. The following graph demonstrates the P99 spanning a period of 5 minutes from the clients' perspective:
cassandra-stress P99 latency histogram
Measuring flash storage performance does introduce its own set of challenges, which makes it almost impossible to fully characterize a given workload realistically. During our planning phase we did discuss what would have been a proper "ideal workload" and – the hard truth is that such a thing doesn't exist.
That is, the ideal use case for a cache is one where frequently accessed and recent items live in RAM. As demonstrated earlier, in-memory items have an unlimited fetch ceiling and maximize throughput well beyond what modern Cloud NICs can offer. Conversely, if the item isn't frequently accessed then it gets offloaded to disks, as the likelihood of having it being hit again is lower when compared to existing items in memory.
Disks have a much lower fetch rate compared (and as opposed) to RAM. If we were to stress disks to their limits, we would be finding ourselves testing against what a proper cache workload is meant for. These characteristics alone took us to an impasse, with no easy way around it.
Given these constraints, we decided to measure the most pessimistic situation: Compare both solutions serving data (mostly) from block storage, knowing that:
- The likelihood of realistic workloads doing this is somewhere close to zero
- Users should expect numbers in between the previous optimistic cache workload and the pessimistic disk-bound workload in practice.
Prior to any tests, we measured our disks performance using fio and recorded its output. We used the XFS filesystem with its default settings. We were particularly interested in the randread test, showing that the disk delivered up to 340K IOPS with a bandwidth of 1.3G/s for a block size of 4kB. Many small I/O operations match the behavior of how datastores dispatch requests to underlying storage.
Next, we moved on to load data and started a storage-bound workload against both solutions.
Offloading in-memory cache space to flash storage is not necessarily a new thing. Memcached introduced support for Extstore in 2017, and dormando provided a thorough explanation around the case for NVMe in his 2018 article, providing insights on the tradeoffs around performance and costs still relevant to date.
The Extstore wiki page provides extensive detail about the solution’s inner workings. At a high-level, it allows memcached to keep its hash table and keys in memory, but store values onto external storage. Keep in mind that using Extstore is currently incompatible with warm restarts, thus a process restart always invalidates all data stored within the filesystem.
The good thing about Extstore's approach is that it is very simple to reason about. Items are written mainly to RAM and stored in a particular Slab class. During memory pressure, rather than evicting records, the item values are asynchronously flushed to storage and its key and other necessary structures are reallocated to a different slab class.2 On top of the per item overhead, memcached also needs an additional 12 bytes per item containing a pointer to the flash location where the item got stored for efficient retrieval later.
Given the in-memory item overhead, it is important to note that even with Extstore, there will still be a limit to the number of items you will be able to store under a single memcached instance. During our tests, we populated memcached with 1.25B items with a value size of 1KB and a keysize of up to 14 bytes:
Evictions started as soon as we hit approximately 1.25B items, despite free disk space
With Extstore, we stored around 11X the number of items compared to the previous in-memory workload until evictions started to kick in (as shown in the right hand panel in the image above). Even though 11X is an already impressive number, the total data stored on flash was only 1.25TB out of the total 3.5TB provided by the AWS instance.
Obviously larger values will incur a higher storage consumption and, with fewer keys being needed to accomplish such a task, result in better memory utilization. It’s important to call out this detail (and Memcached's documentation even mentions that "the smaller the items stored on flash, the more likely you are to saturate the flash device before your network device"). You probably want to save some RAM space for serving hot items, without having to incur the penalty of seeing Extstore IO threads frequently hitting your disks.
Speaking of IO threads, Extstore relies on buffered IO (using pread()
, pwrite()
syscalls) to dispatch requests to its underlying backing store. The ext_threads
parameter is used to control the number of threads available for Extstore and finding the optimal thread count is up to the user: A small number can leave the disk underutilized, whereas a too high number can introduce thread contention. Support for O_DIRECT
and asynchronous I/O are planned within Extstore's roadmap to address some of the shortcomings involved with relying solely on the Linux page cache.
We slightly modified memcached’s CLI arguments for our performance tests. We introduced extstore relevant parameters and no longer performed any CPU pinning for network IRQs, provided that the previous fio numbers had already shown that disk-bound traffic would be unable to fully maximize our available network bandwidth. The command line used was:
/usr/local/memcached/bin/memcached -v -A -r -m 114100 \
-c 4096 --threads 16 -u scylla -C -o ext_path=/mnt/file:3400G \
-o ext_threads=<thread_count>,ext_wbuf_size=32
On top of these settings, we also ensured that our backing NVMe store had no defined IO scheduler (none in sysfs), and nomerges was set to 2.
Small Payloads
We started by measuring Extstore's performance with an item size of 1KB (excluding the key size). During those tests, we wanted to keep the delicate balance between available item memory and storage utilization. Considering a per-item overhead of 48 bytes, plus 12 bytes for the flash pointer, and a key size ranging from 5 to 14 bytes, we stored 700M items for approximately ~52GB of RAM utilization, and approximately 700GB of raw values.
Our first test involved loading the data using 32 memcached IO-Threads, showing that the disks were underutilized:
iotop: I/O utilization with 32 IO-threads
Even though all IO threads were serving reads, we weren't able to increase throughput further without observing a latency increase on the client-side (matching an increase on memcached's extstore_io_queue
metric) . Next, we doubled the number of IO Threads to 64, resulting in higher throughput:
iotop: I/O utilization under 64 IO-threads
The table below summarizes the numbers we observed while reading with a pipeline depth of 16, as well as with no pipelining:
Test Type | Items per GET | IO Threads | GET Rate | P99 Latency |
---|---|---|---|---|
perfrun_metaget_pipe | 16 | 32 | 188K/s | 4~5 ms |
perfrun_metaget | 1 | 32 | 182K/s | <1ms |
perfrun_metaget_pipe | 16 | 64 | 261K/s | 5~6 ms |
perfrun_metaget | 1 | 64 | 256K/s | 1~2ms |
Larger Payloads
We decided to store 8KB values and target close to 75% disk utilization for our large item tests. Next, we started mcshredder disk test to start reading from those items.
Larger payloads shine on Extstore and can achieve a higher disk utilization
Unsurprisingly, in addition to less overhead and higher disk utilization, an added benefit of storing larger objects in storage is the fact that it can achieve even higher ratios of extended caching space. If we were not using Extstore, we would need somewhere close to 25X more RAM to store the same ~350M 8KB objects (2.8TB total). There are clear cost savings achieved by leveraging Extstore within your existing Memcached infrastructure.
As expected, the results demonstrate that latency was higher when reading primarily from disks than from memory. The following table summarizes the results found under different tests while using different thread counts for Extstore:
Test Type | Items per GET | IO Threads | GET Rate | P99 Latency |
---|---|---|---|---|
perfrun_metaget_pipe | 16 | 16 | 92K/s | 5~6 ms |
perfrun_metaget | 1 | 16 | 90K/s | <1ms |
perfrun_metaget_pipe | 16 | 32 | 110K/s | 3~4 ms |
perfrun_metaget | 1 | 32 | 105K/s | <1ms |
The overall disk bandwidth achieved ranged between 1GB/sec (16 IO threads) to 1.2GB/sec (32 IO threads). We noticed that throughput didn't scale linearly along with the thread count. iotop shows that a single Extstore thread achieved ~66MB/s for 16 IO Threads (dark background), whereas with 32 IO threads (white background), the workload didn't dispatch I/O to all available threads – somewhat of a sign that disks are close to saturation.
iotop: I/O utilization under 16 IO Threads, payload of 8KB
iotop: I/O utilization under 32 IO Threads, payload of 8KB. Note some threads are mostly idle.
A final remark is the fact that workloads without pipelining observed lower latencies, and completed most requests under sub-millisecond response times. In fact, Memcached exposes an extstore_io_queue
metric which helps us understand the different latency numbers: The I/O queue was considerably higher when using pipelining (about 10X more) as opposed to individual GETs. This somehow makes sense, as multikey batches require more I/O to complete on a per request basis.
As opposed to memcached, ScyllaDB is a full-blown database. This entails – among other things, employing crash recovery mechanisms (the commitlog), supporting different compaction algorithms for different workloads, compression for saving storage space, and storing SSTable components in RAM (such as Indexes, Summaries and Bloom Filters) to minimize I/O operations.
More recently, ScyllaDB also introduced support for SSTable index caching in order to further minimize walking through index files on disk. Prior to this feature, all read requests not present in ScyllaDB's row cache would require issuing disk I/O to lookup the position in the Index file where the corresponding key is located. Therefore, ScyllaDB follows a very different approach than Extstore's key-cache implementation. Although ScyllaDB doesn't implement a key-based cache, discussions around the topic do exist (#194, #363).
Beyond these optimizations, there is also the complexity of supporting complex data models, access patterns, replication and background operations (such as streaming, repairs and compaction). At the same time, it guarantees that a fair share of resources are given to tasks competing for I/O, without an impact on the end users' latency.
A database like ScyllaDB requires fine-grained control over every I/O operation, beyond what the Buffered I/O can offer. We explained the reasons why we chose Asynchronous Direct I/O for ScyllaDB a few years back.
One of the challenges involved with AIO/DIO (as noted within the previous article) is that its benefits come with a significant cost: complexity. To overcome this, ScyllaDB uses the userspace Seastar I/O scheduler to prioritize and guarantee quality of service among the various I/O classes that the database has to serve requests for.
Although the implementation is very complex on its own (and we improved the I/O scheduler generations over the years), a notable benefit for users is that getting started with it is particularly simple. During the database setup process, seastar's iotune will benchmark its underlying disks and record the disk bandwidth and IOPS for both reads and writes. This process not only helps the I/O Scheduler make informed decisions and keep the disk concurrency high. It also prevents the need for fine-tuning disk-related settings, as we did for Extstore threads.
The following output demonstrates the contents of io_properties.yaml
, used by the I/O scheduler once the database starts. As we can see, the numbers aren't very different from the ones recorded within the previous fio outputs.
disks:
- mountpoint: /var/lib/scylla
read_iops: 313289
read_bandwidth: 3116245504
write_iops: 91691
write_bandwidth: 2287052288
While Memcached uses RAM to store the relevant pointers on storage for each key, ScyllaDB reads follow a totally different approach:
- The database first selects a list of candidate SSTables likely to contain the data via its Bloom Filters
- Next, each SSTable summary file is looked up to estimate the position of the relevant key
- The estimated positions are then searched through in the actual Index file
- Once the key is found, a read is issued to the Data file to retrieve the actual data
Steps (1) and (2) happen in memory, and step (3) may happen in memory or not, depending on whether the Index in question is cached. In a sense, memcached’s approach favors fast and efficient data retrieval, whereas ScyllaDB favors the ability to store and query large data sets ranging from hundreds of millions to billions of items.
Given this approach, ScyllaDB isn't configured by default to heavily scan small partitions from disk. That is, in a key-value workload, every partition read would walk through the 4 steps above, inevitably adding a level of overhead. Conversely, wide-rows benefit from the default settings. The number of disk seeks to the Index file is often negligible compared to the number of sequential reads done to the actual data file.
ScyllaDB does provide optimization options suitable for key-value workloads. In particular, one possible optimization is to increase the cardinality of its Summary files and therefore enhance the precision when looking up the Index file. This task can be accomplished via the sstable_summary_ratio configuration parameter. The tradeoff is the increased size of the summary files stored in RAM. If you want to read more about this setting, check out how Pagoda optimized their disk performance with ScyllaDB. Another optimization would be to adjust the index_cache_fraction option to minimize Index scans on disk even further.
It is worth keeping an eye on ScyllaDB's Non-LSA memory consumption prior to making such a change. Increasing its footprint effectively reduces the total row-cache size, which may be particularly important for storing more items within the database cache.
Non-LSA Memory
The last overhead relates to the commitlog, a WAL sitting in the hot write path by default. Although ScyllaDB does allow users to disable the commitlog (via the --enable-commitlog
option), we haven't explicitly set that option. As a result, every write also incurred additional I/O before being acknowledged back to the client.
Small Payloads
Following the memcached tests, we populated ScyllaDB with 700M items of 1KB. The schema used was a plain simple key-value store, where a single partition holds a single row. On top of these changes, the ScyllaDB index_cache_fraction
setting also got increased to 0.5 (from its default 0.2) to mimic (though not quite) a similar hash table allocation as the one observed during the memcached tests.
The results show ScyllaDB achieved an impressive rate of over 19K reads/second per shard with server-side P99 tail-latencies of 2 ms:
Left: Per core (shard) throughput. Right: P99 Latency
At the disk I/O level, note how ScyllaDB required much less bandwidth (but got close to maximizing disk read IOPS) to achieve its throughput, thanks to its AIO/DIO implementation:
ScyllaDB I/O scheduler bandwidth and IOPS
Finally, the client-side reported a throughput rate of 268K ops/s, under a collective P99 latency of 2.4ms:
Client-side latency histogram
Larger Payloads
During the ingestion phase, we populated ScyllaDB with the same 350M records of 8KB in size as we've done for Extstore. That brought the total storage utilization to about 81%, around 6% more than Extstore's consumption.
ScyllaDB storage utilization after ingesting 350M ~8KB records
Our next and final step was, of course, to read from this data. Before doing so, we applied the following optimization settings for reading from this particular key-value workload:
cache_index_pages: true
--io-latency-goal-ms 1
Beyond these settings, we also changed the compaction strategy to Leveled Compaction (a great fit for read-mostly workloads) and turned off SSTable compression, (primarily because) most of our data was stored using blobs.
The results summary shows that ScyllaDB delivered close to 160K reads/sec with a P99 latency below 2ms entirely from disks. The following monitoring snapshot, shows that no keys within ScyllaDB's row cache were accessed during the run (except for client reads during startup):
Except during the initial connection stage, all traffic is being directed to disk - Note how most traffic results in a Cache Miss
The following graph demonstrates the P99 spanning a period of 5 minutes from the clients' perspective:
cassandra-stress P99 latency histogram
Despite the increased rate, the number of IOPS and bandwidth achieved was very similar to memcached's numbers. Note how starvation time increases over time. That’s yet another indication that the disks cannot afford timely dispatching for some requests (thus increasing the requests' latency):
ScyllaDB I/O Scheduler Metrics
Finally, as a comparison to Extstore results, note how ScyllaDB achieved higher (although similar) bandwidth than memcached's 32 IO threads:
iotop: ScyllaDB I/O
Following our previous Disk results, we then compared both solutions in a read-mostly workload targeting the same throughput (250K ops/sec). The workload in question is a slight modification of memcached's 'basic' test for Extstore, with 10% random overwrites. It is considered a "semi-worst case scenario." This does make sense, as Cache workloads naturally shine upon higher read ratios.
Memcached achieved a rate of slightly under 249K during the test. Although the write rates remained steady during the duration of the test, we observed that reads slightly fluctuated throughout the run:
Memcached: Read-mostly workload metrics
We also observed slightly high extstore_io_queue
metrics despite the lowered read ratios, but latencies were still kept low. These results are summarized below:
Operation | IO Threads | Rate | P99 Latency |
---|---|---|---|
cmd_get | 64 | 224K/s | 1~2 ms |
cmd_set | 64 | 24.8K/s | <1ms |
The ScyllaDB test was run using 2 loaders, each with half of the target rate. Even though ScyllaDB achieved a slightly higher throughput (259.5K), the write latencies were kept low throughout the run and the read latencies were higher (similarly as with memcached):
ScyllaDB: Read-mostly workload metrics
The table below summarizes the client-side run results across the two loaders:
Loader | Rate | Write P99 | Read P99 |
---|---|---|---|
loader1 | 124.9K/s | 1.4ms | 2.6 ms |
loader2 | 124.6K/s | 1.3ms | 2.6 ms |
Both memcached and ScyllaDB managed to maximize the underlying hardware utilization across all tests and keep latencies predictably low. So which one should you pick? The real answer: It depends.
If your existing workload can accommodate a simple key-value model and it benefits from pipelining, then memcached should be more suitable to your needs. On the other hand, if the workload requires support for complex data models, then ScyllaDB is likely a better fit.
Another reason for sticking with Memcached: it easily delivers traffic far beyond what a NIC can sustain. In fact, in this Hacker News thread, dormando mentioned that he could scale it up past 55 million read ops/sec for a considerably larger server. Given that, you could make use of smaller and/or cheaper instance types to sustain a similar workload, provided the available memory and disk footprint suffice your workload needs.
A different angle to consider is the data set size. Even though Extstore provides great cost savings by allowing you to store items beyond RAM, there's a limit to how many keys can fit per GB of memory. Workloads with very small items should observe smaller gains compared to those with larger items. That’s not the case with ScyllaDB, which allows you to store billions of items irrespective of their sizes.
It’s also important to consider whether data persistence is required. If it is, then running ScyllaDB as a replicated distributed cache provides you greater resilience and non-stop operations, with the tradeoff being (and as memcached correctly states) that replication halves your effective cache size. Unfortunately Extstore doesn't support warm restarts and thus the failure or maintenance of a single node is prone to elevating your cache miss ratios. Whether this is acceptable or not depends on your application semantics: If a cache miss corresponds to a round-trip to the database, then the end-to-end latency will be momentarily higher.
With regards to consistent hashing, memcached clients are responsible for distributing keys across your distributed servers. This may introduce some hiccups, as different client configurations will cause keys to be assigned differently, and some implementations may not be compatible with each other. These details are outlined in Memcached’s ConfiguringClient wiki. ScyllaDB takes a different approach: consistent hashing is done at the server level and propagated to clients when the connection is first established. This ensures that all connected clients always observe the same topology as you scale.
So who won (or who lost)? Well… This does not have to be a competition, nor an exhaustive list outlining every single consideration for each solution. Both ScyllaDB and memcached use different approaches to efficiently utilize the underlying infrastructure. When configured correctly, both of them have the potential to provide great cost savings.
We were pleased to see ScyllaDB matching the numbers of the industry-recognized Memcached. Of course, we had no expectations of our database being "faster." In fact, as we approach microsecond latencies at scale, the definition of faster becomes quite subjective. :-)
Question: Why were tests run against a single ScyllaDB node?
- Answer: The typical ScyllaDB deployment starts with a 3-node cluster, usually spread out across different cloud availability zones. In this testing, we wanted to compare ScyllaDB against memcached in an apple to apples comparison. Although it is perfectly possible for memcached to also be run in a distributed fashion, we decided to keep the testing simple.
Question: Why did you disable CAS on memcached?
- Answer: To save on per-item overhead. Refer to the memcached (How Much Memory Will an Item Use) page.
Question: Why didn't you introduce elevated write rates?
- Answer: Caches are a great fit for read-intensive workloads. An elevated write ratio introduces eviction (as data eventually can no longer fit within the cache) and may end up introducing an anti-pattern known as Cache Thrashing, not to mention the risk of evicting important cached items due to overcaching. This effectively does more harm than good.
Question: Why haven't you tested <add particular workload> or <add specific configuration>?
- Answer: It is almost impossible to test ALL particularities. This work has been peer reviewed, and we believe the presented scenarios are close enough and relevant to a majority of users. Yes, there will be conditions where one solution will shine more than others. We showed you how it's done; feel free to compare your particular solution/settings there and share your results.
Question: Why Memcached and not Redis, ElastiCache, Memorystore, etc?
- Answer: Testing Memcached was an easy decision because it’s fully open source without vendor specific code. Plus, memcached is really simple to get started with.
Question: Why not use YCSB, nosqlbench, memtier or <add your favorite> instead?
- Answer: Some of these tools are outdated (YCSB is a prime example of this) and/or may not follow each solution's best practices. We'll be happy to use your tool next time if you can prove it can attain similar or better results.
Footnotes
-
Keep in mind that specially during tests maxing out the promised Network Capacity, we noticed throttling shrinking down the bandwidth down to the instances' baseline capacity. ↩
-
In reality, memcached documentation states that "If you're writing new data to memcached faster than it can flush to flash, it will evict from the LRU tail rather than wait for the flash storage to flush. This is to prevent many failure scenarios as flash drives are prone to momentary hangs". This shouldn't generally be a concern because cache workloads are suitable for ephemeral data. ↩