[2015/09/11 addendum: replaced fans after receiving
To fix repeated fan "revving"
ipmitool sensor list all | grep FAN
FAN1 | 1400.000 | RPM | ok | 300.000 | 500.000 | 700.000 | 25300.000 | 25400.000 | 25500.000
FAN2 | na | | na | na | na | na | na | na | na
FAN3 | 1000.000 | RPM | ok | 300.000 | 500.000 | 700.000 | 25300.000 | 25400.000 | 25500.000
# warning: your fans may be different
ipmitool sensor thresh "FAN1" lower 100 200 300
ipmitool sensor thresh "FAN3" lower 100 200 300
This blog post describes how we built a high-performing NAS server using off-the-shelf components and open source software (FreeNAS). The NAS has the following characteristics:
- total cost (before tax & shipping): $2,631
- total usable storage: 16.6 TiB
- cost / usable GiB: $0.16/GiB
- IOPS: 884 [1]
- sequential read: 1882MB/s [1]
- sequential write: 993MB/s [1]
- double-parity RAID: RAID-Z2
[2014-10-31 We have updated the performance numbers. The old numbers were wrong (they were too low). Specifically, the old performance numbers were generated by a benchmark which was CPU-bound, not disk-bound. We re-generated the numbers by running 8 benchmarks in parallel and aggregating the results]
[caption id="attachment_30838" align="alignnone" width="630"] Our NAS server: 28TB raw data. 1 liter Pellegrino bottle is for scale[/caption]
[Author's note & disclosure: This FreeNAS server is a personal project, intended for my home lab. At work we use an EMC VNX5400, with which we are quite pleased—in fact we are planning to triple its storage. I am employed by Pivotal Labs, which is partly-owned by EMC, a storage manufacturer]
Prices do not include tax and shipping. Prices were current as of September, 2014.
- $375: 1 × Supermicro A1SAi-2750F Mini-ITX 20W 8-Core Intel C2750 motherboard. This motherboard includes the Intel 8-core Atom C2750F. We chose this particular motherboard and chipset combination for its mini-ITX form factor (space is at a premium in our setup) and its low TDP (thermal design power) (better for the environment, lower electricity bills, heat-dissipation not a concern). The low TDP allows for passive CPU cooling (i.e. no CPU fan).
- $372: 4 × Kingston KVR13LSE9/8 8GB ECC SODIMM. 32GiB is a good match for our aggregate drive size (28TB), for ZFS allocates approximately 1GiB RAM for every 1TB raw storage (i.e. we use 28GiB used for ZFS, leaving 4GiB for the operating system and L2ARC).
- $1,190: 7 × Seagate 4TB NAS HDD ST4000VN000. According to Calomel.org, 'You do not have the get the most expensive SAS drives … Look for any manufacture [sic] which label their drives as RE or Raid Enabled or "For NAS systems."' Calomel.org goes on to warn against using power saving or ECO mode drives, for they have poor RAID performance.
- $238: 1 × LSI SAS 9211-8i 6Gb/s SAS Host Bus Adapter. Although expensive SAS drives don't offer much value over less-expensive Raid Enabled SATA drives, the SAS/SATA controller makes a big difference, at times more than three-fold. Calomel.org's section "All SATA controllers are NOT created equal" has a convincing series of benchmarks.
- $215: 1 × Crucial MX100 512GB SSD. This drive is intended as a caching drive (ZFS's L2ARC for reads ZFS's ZIL's SLOG for writes). We chose this drive in particular because it offered power-loss protection (important for synchronous writes); however, Anandtech pointed out in the section "The Truth About Micron's Power-Loss Protection)" that "… the M500, M550 and MX100 do not have power-loss protection—what they have is circuitry that protects against corruption of existing data in the case of a power-loss." We encourage the research of alternative SSDs if power-loss protection is desired.
- $110: 1 × Corsair HX650 650 watt power supply. We believe we over-specified the wattage required for the power supply. This poster recommends the Silverstone ST45SF-G, whose size is half that of a standard ATX power supply, making it well-suited for our chassis.
- $100: 1 × Lian Li PC-Q25B Mini ITX chassis. We like this chassis because in spite of its small form factor we are able to install 7 × 3.5" drives, 1 × 2.5" SSD, and the LSI controller.
- $31: 2 × HighPoint SF-8087 → 4 × SATA cables. Although we used a different manufacturer, these cables should work.
[caption id="attachment_30839" align="alignnone" width="630"] The inside view of the NAS. Note the interesting 3.5" drive layout: 5 of them are in a column, and the remaining 2 (along with the SSD) are installed near the LSI controller[/caption]
[caption id="attachment_30840" align="alignnone" width="630"] For easier installation of the power supply, we recommend removing the retaining screw of the controller card (it's not needed). Note that in the photo the screw has already been removed.[/caption]
The assembly is straightforward.
- Do not plug a 4-pin 12V DC power into connector J1—it's meant as an alternative power source when the 24-pin ATX power is not in use
- Use a consistent system when plugging in SATA data connectors. The LSI card has 2 × SF-8087 connectors, each of which fan out to 4 SATA connectors. We counted the drives from the top, so the topmost drive was connected to SF-8087 port 1 SATA cable 1, the second topmost drive was connected to SF-8087 port 1 SATA cable 2, etc… We connected the SSD to SF-8087 port 2 cable 4 (i.e. the final cable).
There are two caveats to the initial power on:
- the unit's power draw is so low that the Corsair power supply's fan will not turn on
- make sure your VGA monitor is turned on and working (ours wasn't)
We download the USB image from here. We follow the OS X instructions from the manual; Windows, Linux, and FreeBSD users should consult the manual for their respective operating system.
If you have previously-formatted your USB drive with GPT (not MBR) partitioning, you will need to wipe the second GPT table as described here. These are the commands we used. Your commands will be similar, but the sector numbers will be different. Be cautious.
# use diskutil list to find the device name of our (inserted USB)
diskutil list
# in this case it's "/dev/disk2"
diskutil info /dev/disk2
# "Total Size: 16.0 GB (16008609792 Bytes) (exactly 31266816 512-Byte-Units)"
# Determine the beginning of the final 8 blocks (512-byte blocks, final 4kB):
# 31266816 - 8 = 31266808
# wipe the last 4k bytes
sudo dd if=/dev/zero of=/dev/disk2 bs=512 oseek=31266808
Per the FreeNAS user manual:
cd ~/Downloads
xzcat FreeNAS-9.2.1.8-RELEASE-x64.img.xz > FreeNAS-9.2.1.8-RELEASE-x64.img
# use diskutil list to find the device name of our (inserted USB)
diskutil list
# in this case it's "/dev/disk2"
sudo dd if=FreeNAS-9.2.1.8-RELEASE-x64.img of=/dev/disk2 bs=64k
We do the following:
- place the USB key in one of the black USB 2 slots, not one of the blue USB 3 slots (USB 3.0 support is available if needed, check the FreeNAS User Guide for more information.
- connect an ethernet cable to the ethernet port that is closest to the blue USB slots
- turn on the machine: it boots from the USB key without needing modified BIOS settings
We see the following screen:
[caption id="attachment_30843" align="alignnone" width="630"] The FreeNAS console. Many basic administration tasks can be performed here, mostly related to configuring the network. As our DHCP server has supplied network connectivity to the FreeNAS, we are able to configure it via the richer web interface[/caption]
We log into our NAS box via our browser: http://nas.nono.com (we have previously created a DNS entry (nas.nono.com), assigned an IP address (10.9.9.80), determined the NAS's ethernet MAC address, and entered that information into our DHCP server's configuration).
[caption id="attachment_30844" align="alignnone" width="522"] Before anything else, we must first set our root password[/caption]
- click the System icon
- Settings → General
- Protocol: HTTPS
- click Save
We are redirected to an HTTPS connection with a self-signed cert. We click through the warnings.
- click the System icon
- System Information → Hostname
- click Edit
- Hostname: nas.nono.com
- click OK
We enable ssh in order to allow us to install the disk benchmarking package (bonnie++). We enable AFP, for that will be our primary filesharing protocol. We also enable iSCSI for our ESXi host. We enable CIFS for good measure (we don't have Windows clients, but we may in the future).
- click the Services icon
- click the SSH slider to turn it on
- click the wrench next to the SSH slider.
- check Login as Root with password
- click OK
- click the AFP slider to turn it on
- click the CIFS slider to turn it on
- click the iSCSI slider to turn it on
- click the SSH slider to turn it on
We create one big volume. We choose ZFS's RAID-Z2 [2] :
- click the Storage Icon icon
- click Active Volumes tab
- click ZFS Volume Manager
- Volume name Tank
- under Available disks, click + next to 1 - 4.0TB (7 drives, show) (we are ignoring the 512GB SSD for the time being)
- Volume Layout: RaidZ2 (ignore the non-optimal [3] warning)
- click Add Volume
We create user 'cunnie' for sharing:
- From the left hand navbar: Account → Users → Add User
- Username: cunnie
- Full Name: Brian Cunnie
- Password: some-password-here
- Password confirmation: some-password-here
- click OK
ssh [email protected]
mkdir /mnt/tank/big
chmod 1777 !$
exit
- Click the Sharing icon
- select Apple (AFP))
- click Add Apple (AFP) Share
- Name: big
- Path: /mnt/tank/big
- Allow List: cunnie
- Time Machine: checked
- click OK
- click Yes (enable this service)
- switch to finder
- press cmd-k to bring up Connect to Server dialog
- Server Address: afp://nas.nono.com
- click Connect
- Name: Brian Cunnie
- Password: some-password-here
We use bonnie++ to benchmark our machine for the following reasons:
- it's a venerable benchmark
- it allows easy comparison to Calomel.org's bonnie++ benchmarks
We use a file size of 80GiB to eliminate the RAM cache (ARC) skewing the numbers—we are measuring disk performance, not RAM performance.
ssh [email protected]
# we remount the root filesystem as read-write so that we
# can install bonnie++
mount -o rw /
pkg_add -r bonnie++
# we add root to sudoers because that will allow us
# to run bonnie++ as a _non-root_ user, which it requires.
cat >> /usr/local/etc/sudoers <<EOF
root ALL=(ALL) NOPASSWD: ALL
EOF
# create a temporary directory to hold bonnie++'s
# scratch files
mkdir /mnt/tank/tmp
chmod 1777 !$
# 9 series of runs, 8 jobs in parallel, median value
# kick off 8 jobs (8 cores) to minimize CPU-bottleneck
foreach I (0 1 2 3 4 5 6 7 8)
( sudo -u nobody bonnie++ -m "RAIDZ2_8C" -r 8192 -s 81920 -d /mnt/tank/tmp/ -f -b -n 1; date ) >> /mnt/tank/tmp/bonnie.txt &
( sudo -u nobody bonnie++ -m "RAIDZ2_8C" -r 8192 -s 81920 -d /mnt/tank/tmp/ -f -b -n 1; date ) >> /mnt/tank/tmp/bonnie.txt &
( sudo -u nobody bonnie++ -m "RAIDZ2_8C" -r 8192 -s 81920 -d /mnt/tank/tmp/ -f -b -n 1; date ) >> /mnt/tank/tmp/bonnie.txt &
( sudo -u nobody bonnie++ -m "RAIDZ2_8C" -r 8192 -s 81920 -d /mnt/tank/tmp/ -f -b -n 1; date ) >> /mnt/tank/tmp/bonnie.txt &
( sudo -u nobody bonnie++ -m "RAIDZ2_8C" -r 8192 -s 81920 -d /mnt/tank/tmp/ -f -b -n 1; date ) >> /mnt/tank/tmp/bonnie.txt &
( sudo -u nobody bonnie++ -m "RAIDZ2_8C" -r 8192 -s 81920 -d /mnt/tank/tmp/ -f -b -n 1; date ) >> /mnt/tank/tmp/bonnie.txt &
( sudo -u nobody bonnie++ -m "RAIDZ2_8C" -r 8192 -s 81920 -d /mnt/tank/tmp/ -f -b -n 1; date ) >> /mnt/tank/tmp/bonnie.txt &
( sudo -u nobody bonnie++ -m "RAIDZ2_8C" -r 8192 -s 81920 -d /mnt/tank/tmp/ -f -b -n 1; date ) >> /mnt/tank/tmp/bonnie.txt &
wait
sleep 60
end
#
The raw bonnie++ output is available on GitHub. The summary (median scores): (w=993MB/s, r=1882MB/s, IOPS=884)
The IOPS (~884) are respectable. Although well more than four times as fast as a 15k RPM SAS Drive (~175-210 IOPS), it's still much lower than a high-end SSD offers (e.g. an Intel X25-M G2 (MLC) posts ~8,600). We feel that using the SSD as a second-level cache could improve our numbers dramatically.
We never put the SSD to use. We plan to use the SSD as both a L2ARC (ZFS read cache) and a ZIL SLOG (a ZFS write cache for synchronous writes).
Our NAS's performance is severely limited by the throughput of its gigabit interface on its sequential reads and writes. Our ethernet interface is limited to ~111 MB/s, but our sequential reads can reach almost seventeen times that (1882MB/s).
We can partly address that by using LACP (aggregating the throughput of the 4 available ethernet interfaces).
The fans in the case were noiser than expected, Not clicking or tapping, but a discernible hum.
The system runs cool. With a room temperature of 23.3°C (74° Fahrenheit), these are the readings we recorded after the machine being powered on for 12 hours:
- CPU: 30°C
- System: 32°C
- Peripheral: 31°C
- DIMMA1: 30°C
- DIMMA2: 32°C
- DIMMB1: 33°C
- DIMMB2: 34°C
No component is warmer than body temperature. We are especially impressed with the low CPU temperature, doubly so that it's passively cooled.
It would be nice if the system had a hot-swap feature. It doesn't. In the event we need to replace a drive, we'll be powering the system down.
FreeNAS does the right thing: it creates 4kB-aligned pools by default (instead of a 512B-aligned pools). This should be more efficient, though results vary. See Calomel.org's section, Performance of 512b versus 4K aligned pools for an in-depth discussion and benchmarks.
In our follow-on post, we tune our ZFS fileserver for optimal iSCSI performance.
1 These numbers are not terribly exact. To overcome being artificially limited by the CPU, we were forced to run 8 benchmarks in parallel. This had two serious shortcomings:
- The individual benchmarks weren't synchronized—benchmarks finished as much as ten seconds apart. While one benchmark was finishing up its rewriting portion, another had already moved on to the reading portion, causing a distortion in the usage pattern.
- The numbers weren't derived by summing the numbers from a single run of 8 benchmarks. Instead, all the benchmark results were aggregated, and the median 8 values were taken and summed.
For those interested in the raw benchmark data, they can be seen here.
2 We feel that double-parity RAID is a safer approach than single-parity (e.g. RAID 5). Adam Leventhal, in his article for the ACM, describes the challenges that large capacity disks pose to a RAID 5 solution. A NetApp paper states, "… in the event of a drive failure, utilizing a SATA RAID 5 group (using 2TB disk drives) can mean a 33.28% chance of data loss per year" (italics ours).
3 We aren't concerned about a non-optimal configuration (i.e. the number of disks (less parity) should optimally be a power of 2)—we have reservations about the statement, "the number of disks should be a power of 2 for best performance". A serverfault post states, "As a general rule, the performance boost of adding an additional spinddle [sic] will exceed the performance cost of having a sub-optimal drive count". Also, we are enabling compression on the ZFS volume, which means that the stripe size will be variable rather than a power of 2 (we are guessing; we may be wrong), which de-couples the stripe size from the disks' block size.
Calomel.org has one of the most comprehensive set of ZFS benchmarks and good advice for maximizing the performance of ZFS, some of it not obvious (e.g. the importance of a good controller)
This blog post describes how we tuned and benchmarked our FreeNAS fileserver for optimal iSCSI performance.
For most workloads (except ones that are extremely sequential-read intensive) we recommend using L2ARC, SLOG, and the experimental iSCSI kernel target.
Of particular interest is the experimental iSCSI driver, which increased our IOPS 334% and increased our sequential write performance to its maximum, 112MB/s (capped by the speed of our ethernet connection). On the downside, there was a 45% decrease in sequential read speed.
[2014-11-6 We have added a third round of benchmarks]
Using an L2ARC also improved performance (IOPS increased 46%, sequential write improved 13%, and sequential read decreased 4%).
We also experimented with three ZFS sysctl variables, but they were a mixed bag (they improved some metrics to the detriment of others).
Here is the summary of our results in a chart format:
[caption id="attachment_31381" align="alignnone" width="630"] Summary of Benchmark results. Note that Sequential Write and Read use the left axis (MB/s), and that IOPS is measured against the logarithmic right axis. Higher is better.[/caption]
There is no optimal configuration; rather, FreeNAS can be configured to suit a particular workload:
- to maximize sequential write performance, use the experimental kernel iSCSI target and an L2ARC
- to maximize sequential read performance, use the default userland iSCSI target and no L2ARC
- to maximize IOPS, use the experimental kernel iSCSI target, L2ARC, enable prefetching tunable, and aggressively modify two sysctl variables.
We describe the hardware and software configuration in a previous post, A High-performing Mid-range NAS Server. Highlights:
- FreeNAS 9.2.1.8
- Intel 8-core Avoton C2750
- 32GiB RAM
- 7 x 4TB disks
- RAIDZ2
- 512GB SSD (unused)
- 4 x 1Gbe
We use bonnie++ to measure disk performance. bonnie++
produces many performance metrics (e.g. "Sequential Output Rewrite Latency"); we focus on three of them:
- Sequential Write ("Sequential Output Block")
- Sequential Read ("Sequential Input Block")
- IOPS ("Random Seeks")
We use an 80G file for our bonnie++
tests. We store the raw output of our benchmarks in a GitHub repo.
Our FreeNAS server provides storage (data store) via iSCSI to VMs running on our ESXi server. This post does not cover setting up iSCSI and accessing it from ESXi; however, Steve Erdman has written such a blog post, "Connecting FreeNAS 9.2 iSCSI to ESXi 5.5 Hypervisor and performing VM Guest Backups"
Although we have measured the native performance of our NAS (i.e. we have run bonnie++
directly on our NAS, bypassing the limitation of our 1Gbe interface), we don't find those numbers terribly meaningful. We are interested in real-world performance of VMs whose data store is on the NAS and which is mounted via iSCSI.
We want to know what our upper bounds are; this will be important as we progress in our tuning—once we hit an theoretical maximum for a given metric, there's no point in additional tuning for that metric.
The 1Gb ethernet interface places a hard limit on our sequential read and write performance: 111MB/s.
For comparison we have added the performance of our external USB hard drive (the performance numbers are from a VM whose data store resided on a USB hard drive). Note that the external USB hard drive is not limited by gigabit ethernet throughput, and thus is able to post a Sequential Read benchmark that exceeds the theoretical maximum.
Sequential Write (MB/s) (higher is better) | Sequential Read (MB/s) (higher is better) | IOPS (higher is better) |
|
---|---|---|---|
Untuned | 59 | 74 | 99.8 |
Theoretical Maximum | 111 | 111 | |
External 4TB USB3 7200 RPM | 33 | 159 | 121.8 |
The raw benchmark data is available here.
L2ARC is ZFS's secondary read cache (ARC, the primary cache, is RAM-based).
Using an L2ARC can increase our IOPS "8.4x faster than with disks alone."
ZIL (ZFS Intent Log) SLOG (Separate Intent Log) is a "… separate logging device that caches the synchronous parts of the ZIL before flushing them to slower disk".
Typically an SSD drive is used as secondary cache; we use a Crucial MX100 512GB SSD.
We will implement L2ARC and SLOG and determine the improvement.
L2ARC sizing is dependent upon available RAM (L2ARC exacts a price in RAM), available disk (we have a 512GB SSD), and average buffer size (the L2ARC requires 40bytes of RAM for each buffer. Buffer sizes vary).
We first determine the amount of RAM we have available:
ssh [email protected]
# determine the amount of RAM available
top -d 1 | head -6 | tail -3
Mem: 250M Active, 3334M Inact, 26G Wired, 236M Cache, 467M Buf, 929M Free
ARC: 24G Total, 2073M MFU, 20G MRU, 120K Anon, 1303M Header, 574M Other
Swap: 14G Total, 23M Used, 14G Free
We see we have 5GiB RAM at our disposal for our L2ARC (32GiB total - 1GiB Operating System - 26GiB "Wired" = 5GiB L2ARC).
We arrive at our L2ARC sizing experimentally: we note that when we use a 200GB L2ARC, we see that ~200MiB of swap is used. We prefer not to use swap at all, so we know that we want to reduce our L2ARC RAM footprint by 200MiB (i.e. instead of 5GiB RAM, we only want to use 4.8GiB). We find that a 190GB L2ARC meets that need.
For our configuration, we need 1GiB of RAM for every 38GB of L2ARC
We use this forum post to determine the size of our SLOG:
- The SLOG "must be large enough to hold a minimum of two transaction groups"
- A transaction group is sized by either RAM or time, i.e. "In FreeNAS, the default size is 1/8th your system's memory" or 5 seconds
- Based on 32GiB RAM, our transaction group is 4GiB
- We will triple that amount to 12GB and use that to size our SLOG (i.e. our SLOG will be able to store 3 transaction groups)
We note that we most likely over-spec'ed our SLOG by a factor of 12, "I can't imagine what sort of workload you would need to get your ZIL north of 1 GB of used space"
We use a combination of sysctl
and diskinfo
to determine our disks:
foreach DISK ( `sysctl -b kern.disks` )
diskinfo $DISK
end
da8 512 16008609792 31266816 0 0 1946 255 63
da7 512 4000787030016 7814037168 4096 0 486401 255 63
da6 512 4000787030016 7814037168 4096 0 486401 255 63
da5 512 4000787030016 7814037168 4096 0 486401 255 63
da4 512 512110190592 1000215216 4096 0 62260 255 63
da3 512 4000787030016 7814037168 4096 0 486401 255 63
da2 512 4000787030016 7814037168 4096 0 486401 255 63
da1 512 4000787030016 7814037168 4096 0 486401 255 63
da0 512 4000787030016 7814037168 4096 0 486401 255 63
We see that da4 is our 512G SSD (and da8 is our 16GB bootable USB stick and the remaining disks are our 4TB Seagates which make up our RAID Z2).
We use gpart
to initialize da4. Then we create a 190GB partition which we align on 4kB boundaries (-a 4k
):
gpart create -s GPT da4
da4 created
gpart add -s 190G -t freebsd-zfs -a 4k da4
da4p1 added
Create a 12GB SLOG:
gpart add -s 12G -t freebsd-zfs -a 4k da4
We add our new L2ARC and SLOG partitions:
zpool add tank cache da4p1
zpool add tank log da4p2
zpool status
We perform 7 runs and take the median values for each metric (e.g. Sequential Write). The L2ARC provides us a 14% increase in write speed, a 4% decrease in read speed, and a 46% increase in IOPS.
Sequential Write (MB/s) (higher is better) | Sequential Read (MB/s) (higher is better) | IOPS (higher is better) |
|
---|---|---|---|
Untuned | 59 | 74 | 99.8 |
200G L2ARC | 67 | 71 | 145.7 |
Theoretical Maximum | 111 | 111 |
The raw benchmark data can be seen here.
FreeNAS 9.2.1.6 includes an experimental kernel-based iSCSI target. We enable the target and reboot our machine.
- We browse to our FreeNAS server: https://nas.nono.com
- log in
- click the Services icon at the top
- click the "wrench" icon
[caption id="attachment_31317" align="alignnone" width="564"] To modify the iSCSI services settings and enable the experimental kernel driver, click the wrench icon[/caption]
- check the Enable experimental target checkbox
[caption id="attachment_31318" align="alignnone" width="630"] Check the "Enable experimental target" to activate the kernel-based iSCSI target [/caption]
- click Save
- we see a message: Enabling experimental target requires a reboot. Do you want to proceed?. Click Yes
- our FreeNAS server reboots
After reboot we notice that our iSCSI service has been disabled (bug?). We re-enable it:
- We browse to our FreeNAS server: https://nas.nono.com
- log in
- click the Services icon at the top
- click the iSCSI slider so it turns on
We perform 9 runs and take the median values for each metric (e.g. Sequential Write). The experimental iSCSI target provides us a 67% increase in write speed (hitting the theoretical limit), a 45% decrease in read speed, and a 334% increase in IOPS.
The decrease in read speed is curious; we hope it's a FreeBSD bug that has been addressed in 10.0.
Sequential Write (MB/s) (higher is better) | Sequential Read (MB/s) (higher is better) | IOPS (higher is better) |
|
---|---|---|---|
Untuned | 59 | 74 | 99.8 |
200G L2ARC | 67 | 71 | 145.7 |
L2ARC + Experimental kernel-based iSCSI target | 112 | 39 | 633.0 |
Theoretical Maximum | 111 | 111 |
The raw benchmark data is available here.
We want to aggressively use the L2ARC. The FreeBSD ZFS Tuning Guide suggests focusing on 3 tunables:
vfs.zfs.l2arc_write_boost
vfs.zfs.l2arc_write_max
vfs.zfs.l2arc_noprefetch
We ssh into our NAS to determine the current settings:
ssh [email protected]
sysctl -a | egrep -i "l2arc_write_max|l2arc_write_boost|l2arc_noprefetch"
vfs.zfs.l2arc_noprefetch: 1
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608
The FreeBSD ZFS Tuning Guide states, "Modern L2ARC devices (SSDs) can handle an order of magnitude higher than the default". We decide to increase the amount from 8 MB/s to 201 MB/s (we increase it 25 times):
- on the left hand navbar, navigate to System → Sysctls → Add Sysctl
- Variable: vfs.zfs.l2arc_write_max
- Value: 201001001
- Comment: 201 MB/s
- click OK
- click Add Sysctl
- Variable: vfs.zfs.l2arc_write_boost
- Value: 201001001
- Comment: 201 MB/s
- click OK
vfs.zfs.l2arc_noprefetch
is interesting: it allows us to cache streaming data. Unfortunately, it must be set before the ZFS pool is imported (i.e. it can't be set in /etc/sysctl.conf
; it must be set in /boot/loader.conf
). That means we must set this variable as a tunable rather than as a sysctl:
- on the left hand navbar, navigate to System → Tunables → Add Tunable
- Variable: vfs.zfs.l2arc_noprefetch
- Value: 0
- Comment: disable no_prefetch (enable prefetch)
- click OK
Reboot (browse the lefthand navbar of the web interface and click Reboot). Click Reboot when prompted.
We run our tests and note the following results:
- Sequential write performance drops 35% from 112 MB/s to 73
- Sequential read performance increases 35% from 39 MB/s to 53
- IOPS performance more than doubles (120%) from 633 to 1392.
Sequential Write (MB/s) (higher is better) | Sequential Read (MB/s) (higher is better) | IOPS (higher is better) |
|
---|---|---|---|
Untuned | 59 | 74 | 99.8 |
200G L2ARC | 67 | 71 | 145.7 |
L2ARC + Experimental kernel-based iSCSI target | 112 | 39 | 633.0 |
L2ARC + Experimental kernel-based iSCSI target + tuning | 73 | 53 | 1392 |
The raw benchmark data can be seen here. The results are a mixed bag—we like the improved read and IOPS performance, but we're dismayed by the drop in the write performance, which is our most important metric (our workload is write-intensive)
We disable pre-fetch.
- vfs.zfs.l2arc_noprefetch=1
This requires us to reboot our NAS to take effect.
We also use a more rigorous approach to setting the two remaining variables by determining the write throughput of the SSD by copying a large file to a raw partition; we determine that the SSD can sustain a write throughput of 193 MB/s:
ssh [email protected]
# add a 20G for benchmark testing
gpart add -s 20G -t freebsd-zfs -a 4k da4
da4p3 added
# let's copy an 11G file to benchmark the SSD raw write speed
dd if=/mnt/tank/big/iso/ML_2012-08-27_18-51.i386.hfs.dmg of=/dev/da4p3 bs=1024k
dd: /dev/da4p3: Invalid argument
11460+1 records in
11460+0 records out
12016680960 bytes transferred in 62.082430 secs (193560094 bytes/sec)
# our SSD write throughput is 193 MB/s
# remove the unneeded device
gpart delete -i 3 da4
- vfs.zfs.l2arc_write_max=193560094
- vfs.zfs.l2arc_write_boost=193560094
Sequential Write (MB/s) (higher is better) | Sequential Read (MB/s) (higher is better) | IOPS (higher is better) |
|
---|---|---|---|
Untuned | 59 | 74 | 99.8 |
200G L2ARC | 67 | 71 | 145.7 |
L2ARC + Experimental kernel-based iSCSI target | 112 | 39 | 633.0 |
L2ARC + Experimental kernel-based iSCSI target + tuning | 73 | 53 | 1392 |
L2ARC + Experimental kernel-based iSCSI target + tuning Round 2 | 65 | 35 | 1178 |
This has been a step backwards for us—every metric performed worse. We suspect that disabling pre-fetch was a mistake.
The raw data is available here.
We re-enable pre-fetch:
- vfs.zfs.l2arc_noprefetch=0
We drop the value of the remaining sysctls from 193 MB/s to 81 MB/s (the FreeBSD Tuning Guide suggested an order-of-magnitude increase from the default of 8 MB/s; we increase by 10× (one order of magnitude) rather than by 23×):
- vfs.zfs.l2arc_write_max=81920000
- vfs.zfs.l2arc_write_boost=81920000
We run our benchmark again:
Sequential Write (MB/s) (higher is better) | Sequential Read (MB/s) (higher is better) | IOPS (higher is better) |
|
---|---|---|---|
Untuned | 59 | 74 | 99.8 |
200G L2ARC | 67 | 71 | 145.7 |
L2ARC + Experimental kernel-based iSCSI target | 112 | 39 | 633.0 |
L2ARC + Experimental kernel-based iSCSI target + tuning | 73 | 53 | 1392 |
L2ARC + Experimental kernel-based iSCSI target + tuning Round 2 | 65 | 35 | 1178 |
L2ARC + Experimental kernel-based iSCSI target + tuning Round 3 | 97 | 36 | 1633 |
This configuration has achieved the highest IOPS score of any we've benchmarked—a sixteen-fold increase from the untuned configuration. It approaches the IOPS of an SSD (~8,600).
This also posts the second-highest sequential write throughput, a quite-respectable 97 MB/s.
The sequential read is disappointing—the only good thing to say that it's not the absolute worst (but it is the second-worst). To re-iterate, we hope that this is a kernel iSCSI bug that's addressed in a future release of FreeBSD.
The raw data is available here.
There's no "magic bullet" to ZFS performance tuning that improves all metrics.
For most workloads (except ones that are extremely sequential-read intensive) we recommend using L2ARC, SLOG, and the experimental iSCSI kernel target.
We chose the final configuration (best IOPS, second-best sequential write, second-worst sequential read) for our setup. Our workload is write-intensive and IOPS-intensive.
- Subtle inconsistencies: some tests were run with a 200GB L2ARC and no SLOG, and later tests were run with a 190GB L2ARC and a 12GB SLOG
- Not completely-dedicated NAS: the NAS was not completely dedicated to running benchmarks—it was also acting as a Time Machine backup during the majority of the testing. It is possible that some of the numbers would be slightly higher if it was completely dedicated
- The size of the test file (80G) was very specific: it was meant to exceed the ARC but not exceed the L2ARC. We ran three tests with file sizes smaller than the ARC, and the results (unsurprisingly) were excellent (4GB file, 8GB, and 16GB file)
- The test took almost 24 hours to run; this impeded our ability to run as many tests as we would have liked
- We would have liked to have run some of the benchmarks a second time to eliminate the possibility that our testbed changed (e.g. intense benchmarking may have caused the SSD performance to diminish towards the end of the testing)
- We would have liked to have been able to eliminate the limitation of the 1 gigabit ethernet link; it would be interesting to see the performance with a 10Gbe link
- The scope of the tests were very narrow (e.g. iSCSI-only, a very specific server hardware configuration). It would be overreaching to generalize these numbers to all ZFS fileservers or even all protocols (e.g. AFP, CIFS).
I'm seeing the following errors in dmesg
:
MCA: Bank 10, Status 0x8c000047000800c1
MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x50663, APIC ID 0
MCA: CPU 0 COR (1) MS channel 1 memory error
MCA: Address 0x12a65a9540
MCA: Misc 0x908404000400e8c
mcelog
was not useful, identifying it as a CPU error:
Hardware event. This is not a software error.
CPU 0 BANK 10
MISC 908404000400e8c ADDR 12a65a9540
MCG status:
MemCtrl: Corrected patrol scrub error
STATUS 8c000047000800c1 MCGSTATUS 0
MCGCAP 1000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 86
28 of them over a 12 hour period:
dmesg | grep ^MCA: | sort | uniq -c sort | uniq -c`
3 MCA: Address 0x12a65a9540
25 MCA: Address 0x12a65a9580
3 MCA: Bank 10, Status 0x8c000047000800c1
12 MCA: Bank 7, Status 0x8c00004000010091
13 MCA: Bank 7, Status 0xcc00008000010091
3 MCA: CPU 0 COR (1) MS channel 1 memory error
12 MCA: CPU 0 COR (1) RD channel 1 memory error
13 MCA: CPU 0 COR (2) OVER RD channel 1 memory error
28 MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
11 MCA: Misc 0x15020a086
7 MCA: Misc 0x150222286
2 MCA: Misc 0x15024a486
2 MCA: Misc 0x15030b086
3 MCA: Misc 0x150323286
3 MCA: Misc 0x908404000400e8c
28 MCA: Vendor "GenuineIntel", ID 0x50663, APIC ID 0
Taking the output of dmidecode
, it appears the DIMM, based on the starting
address and the ending address, is the Micron with serial number
164A457:
Handle 0x001F, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0019
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMMB1
Bank Locator: P0_Node0_Channel1_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: 2133 MT/s
Manufacturer: Micron
Serial Number: 16F4A457
Asset Tag: (Date:17/19)
Part Number: 36ASF4G72PZ-2G1B1
Rank: 2
Configured Clock Speed: 2133 MT/s
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: 0.003 V
Handle 0x0020, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x01000000000
Ending Address: 0x017FFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x001F
Memory Array Mapped Address Handle: 0x001A
Partition Row Position: 1
The address, 0x12a65a9580
is 80100365696 in decimal, and is roughly 75GiB,
which means the error is probably somewhere in the first two DIMMs (the
Crucial?).