Skip to content
Blake Matheny edited this page May 2, 2013 · 3 revisions

Goals

The primary goal of the project (besides fun) was to create an implementation of mctop that lost as little traffic as possible. In doing so, I discovered a few libpcap quirks as well as some interesting benchmark data.

Design, Performance, Concurrency

I spent a bit of time benchmarking different packet capture strategies as well as different data structures for avoiding a global data lock.

In my head I had a process that looked something like:

  1. Capture loop pushes packets onto a queue
  2. Consumer(s) pull packets off queue for parsing
  3. Parse packet into memcache command, push cmd onto stats queue for processing
  4. Stats consumer(s) pull parsed memcache commands off queue to integrate into a datastructure

This design has a few challenges.

  1. The capture loop must be non-blocking
  2. A single packet parsing consumer will fall behind (introducing reporting latency)
  3. Once interacting with stats data structure, need to reduce global locks (again, avoid reporting latency)
  4. Same deal as 3.

Initially I created a concurrent queue so that I could have multiple producers creating packets and multiple consumers pulling (and parsing) packets. I was unable to do so with the version of g++ available to me in a way that avoided any kind of significant overhead for locks. I settled on a lock free single producer/single consumer queue implementation. This avoided the lock overhead but caused a consumption bottle neck. This evolved into a striped queue implementation where N threads each own a Packet queue. Packets are ~evenly distributed across queues. This implementation kept average latency between producing a packet and consuming it, with 3 threads, at 1ms. Using a concurrent queue, the average latency was several seconds.

The next issue I ran into was similar to the one previously described, but for stats aggregation. I used the lock free queue implementation as a barrier between the capture engine and the stats engine. However, once a memcache command was pulled off the queue, a map needed to be interacted with.

Need to finish this section.

Libpcap Quirks

Internally the code uses pcap_loop for capturing packets. Doing this means that the capture loop needs to essentially be non-blocking, or else we run the risk of losing packets. I found that if I called pcap_setnonblock(handle, true, ...) I basically ate a CPU spinning waiting for data. However, without that I found that I lost an unreasonable amount of data. Given that memcached doesn't typically take full advantage of all processor cores, I decided to stick with pcap_setnonblock(true).

Benchmarks

I primarily used memslap to drive load on a test box. The load I was able to drive outpaced what I typically see in production. The parameters I used were: memslap -s 172.16.118.229 -S 2s -t 90s -T 8 -c 128

Building

Package dependencies are well described in the README.md. That gives you all the prerequisite packages, etc and then you:

export CXX=g++44
./autogen.sh
./configure

That's it. This should also work with g++ 4.7, so it might with 4.6 as well (but I have tested the 4.7 build and not the 4.6 build).

Clone this wiki locally