-
Notifications
You must be signed in to change notification settings - Fork 58
Home
The primary goal of the project (besides fun) was to create an implementation of mctop that lost as little traffic as possible. In doing so, I discovered a few libpcap quirks as well as some interesting benchmark data.
I spent a bit of time benchmarking different packet capture strategies as well as different data structures for avoiding a global data lock.
In my head I had a process that looked something like:
- Capture loop pushes packets onto a queue
- Consumer(s) pull packets off queue for parsing
- Parse packet into memcache command, push cmd onto stats queue for processing
- Stats consumer(s) pull parsed memcache commands off queue to integrate into a datastructure
This design has a few challenges.
- The capture loop must be non-blocking
- A single packet parsing consumer will fall behind (introducing reporting latency)
- Once interacting with stats data structure, need to reduce global locks (again, avoid reporting latency)
- Same deal as 3.
Initially I created a concurrent queue so that I could have multiple producers creating packets and multiple consumers pulling (and parsing) packets. I was unable to do so with the version of g++ available to me in a way that avoided any kind of significant overhead for locks. I settled on a lock free single producer/single consumer queue implementation. This avoided the lock overhead but caused a consumption bottle neck. This evolved into a striped queue implementation where N threads each own a Packet queue. Packets are ~evenly distributed across queues. This implementation kept average latency between producing a packet and consuming it, with 3 threads, at 1ms. Using a concurrent queue, the average latency was several seconds.
The next issue I ran into was similar to the one previously described, but for stats aggregation. I used the lock free queue implementation as a barrier between the capture engine and the stats engine. However, once a memcache command was pulled off the queue, a map needed to be interacted with.
Need to finish this section.
Internally the code uses pcap_loop
for capturing packets. Doing this means that the capture loop needs to essentially be non-blocking, or else we run the risk of losing packets. I found that if I called pcap_setnonblock(handle, true, ...)
I basically ate a CPU spinning waiting for data. However, without that I found that I lost an unreasonable amount of data. Given that memcached doesn't typically take full advantage of all processor cores, I decided to stick with pcap_setnonblock(true)
.
I primarily used memslap
to drive load on a test box. The load I was able to drive outpaced what I typically see in production. The parameters I used were: memslap -s 172.16.118.229 -S 2s -t 90s -T 8 -c 128
Package dependencies are well described in the README.md. That gives you all the prerequisite packages, etc and then you:
export CXX=g++44
./autogen.sh
./configure
That's it. This should also work with g++ 4.7, so it might with 4.6 as well (but I have tested the 4.7 build and not the 4.6 build).