What we need to know about Applications to design ZFS-based SAN

Importance of knowing applications in the environment cannot be overstated. Without understanding the complex relationships between applications, their peak usage times and overall usage patterns we cannot reliably scope and design a storage solution that will meet needs of the environment.

Our storage system has to correctly respond to needs of applications in our environment and for that to happen we have to understand what our applications do exactly and how they do it. First we have to accept that by their very nature SANs are extremely homogeneous, and it is extremely difficult to tell at a specific point in time how storage will behave, because of this homogeneity. In all but very few environments we deploy SANs to centralize, pool and unify storage, taking full advantage of being able to easily share data, reduce waste from having local disks on machines that see little use, reduce heat and power consumption from having local disks in each machine, increase data redundancy and protection. As we do this though we have to remain cognizant of the fact that some applications that used to run on local storage directly attached to a server now have to run on disks at the other end of a wire, competing with other applications that are just as hungry, if not more so for attention from disks.

We sometimes wonder why an application that ran extremely well on just a few disks attached to a server is consistently under-performing after being moved to a SAN with many more disks and large caches. Ultimately we realize that we took for granted the fact that local disks were completely dedicated to this application, while on the SAN this application becomes one of the many applications in a big pond competing for time from disks. It is not possible to tell the SAN to prefer one application over another, at least not possible in a conventional sense. There are many gears and knobs in ZFS, but they can only go so far when the hardware configuration simply does not match what it is expected to do.

It is not enough to understand how one or two applications work if we plan on 10 or more applications to leverage our storage solution. Even identical applications in different environments have unique footprints that may be tremendously different from a typical footprint of this application in other environments. It is not meaningful to simply assume that based on some reference document we know what our applications are expected to do and what their IOPs requirements are. We need to carefully analyze each application first individually, isolating it against a blank background to get some understanding of just what the application does with regard to I/O when there is absolutely no contention, i.e. local disks. For example, how much I/O does our application require at any given moment? Are there times when I/O requirements change rapidly, fluctuating due to perhaps other applications in the environment or certain actions that users may be taking, such as perhaps massive reporting during specific times of week or month. A contrived example is a batch processing application, perhaps there is a load stage lasting a number of hours every night, that requires far more IOPs than any other task in the application. We always have to look at the worst case scenarios and make our decisions based on them. This is not always easy to achieve, but it is feasible to derive some best case and worst case numbers for an application whether it be I/O requirements, daily capacity requirements, etc., and reasonably well extrapolate a typical worst case scenario with which the SAN will have to deal.

As we research our applications we want to get some idea about whether the individual applications are biased towards reads or writes, as this will play a role in our final solution. Environments that tend to be more read-biased have a greater benefit from read cache than do write-biased environments, for obvious reasons. Few environments are extremely biased in one direction or the other, but understanding this will help us in making a decision about how much caching we want to buy on day one and how much we may be adding with time to further improve performance of the SAN. Write-heavy environments may be good candidates for multiple ZIL devices, perhaps two or more mirrors of high-performance SSDs, and perhaps instead of lower performing 7.2K drives 10K or 15K drives to further improve times for writes. Writes always require real IOPs to the disks, whereas reads may occur against cache and require no real I/O from disks, or small amount of I/O to satisfy the request.

Once we have a good idea about how individual applications behave we need to look at the bigger picture to gain some understanding of what the aggregated figures are. The goal is to get some insight into operating averages as well as the outliers, those times when IOPs are unusually high. Averages are commonly used to make decisions. It is important to remain aware that there may seasonal changes in the environment, resulting in far larger requirements, even if for short periods of time, than the average figures may suggest. The more IOPs data we collect the more accurate our averages are going to be.

Another really critical consideration, one which requires understanding of the working set is whether caching will be effective in our environment. Caching is of course most effective in environments where same blocks of data are being requested over and over. In environments that process unique transactions and continuously handle new data caching is less effective, but the typical reason for having a SAN is consolidation which means workloads are aggregated and some may prove more cache-friendly than others. Again understanding our applications is essential. For example, if we have a large virtual environment and maintain and deploy new machines by cloning, or building from template, we may find that a small number of blocks are being accessed over and over, resulting in a small working set and extremely high cache hit ratio. At the other extreme we may be processing some large datasets with unique queries, and shipping results somewhere else away from the SAN. In this scenario we may find that caching helps little, simply because we continually read unique blocks.

In every environment there are a number of applications, some may be critical, others less so, some are really IO hungry and with low tolerance for latency, others far less so. While we cannot only focus on the critical applications we have to start with the most demanding applications and plan for a solution that will be adequate enough to make sure that under the worst case scenario these applications will have enough capacity and IOPs to continue running. The takeaway here is that we should never plan to deploy a SAN, no matter the design, with IOP capability lower than the baseline requirement of these key consumers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WhatweneedtoknowaboutApplicationstodesignZFS-basedSAN.mdown

WhatweneedtoknowaboutApplicationstodesignZFS-basedSAN.mdown

What we need to know about Applications to design ZFS-based SAN

Files

WhatweneedtoknowaboutApplicationstodesignZFS-basedSAN.mdown

Latest commit

History

WhatweneedtoknowaboutApplicationstodesignZFS-basedSAN.mdown

File metadata and controls

What we need to know about Applications to design ZFS-based SAN