Cisco AppDynamics Community

Cody.Naumann · ‎10-25-2017

You have a choice of either hardware or software when configuring your RAID controller. As with all hardware parity-based RAID controllers, the computation of the parity and the strip size are the two most important considerations. Hardware controller has the advantage of the computation of the parity as it usually has a specialized hardware assist, which frees up the host processor. On the downside, many of the hardware controllers have a limitation of 64K for the strip size. If you have an array of four data disks, this means that your minimum stripe size is 256K, and this is not an optimal stripe size. You should look for a controller that handles a minimum strip size of 16K.

The other downside of hardware is the lack of the visibility into the controller. Visibility is requirement for doing optimization. Without it, you can't easily tell when the IO did not match the stripe size and/or if there was additional IO caused by alignment, because it's all handled transparently by the controller. If used correctly, hardware RAID will offer the best performance for RAID. Software has none of these disadvantages, but adds processing time on the host system, which is minimal with today's processor.

Stripe Sizing and Strip/Segment/Chunk Size

The strip size is most critical performance decision when configuring an array, but it's also the most misunderstood parameter. Most vendors do a clever job of hiding the given stripe size and just talk about strip size. The stripe size is the product of strip size (the strip/segment/chunk - the data on one device) multiplied by the number of data disks. If the writes match the stripe size, you won't have to do a read-modify-write, because you are writing the entire stripe. As an example, a four disk (data drives only) array, with a strip size of 16K would yield a 64K stripe size. If you have a hardware controller that has a limit of 64K for the strip size, you are severely limiting your choices of RAID configurations, and it will increase your writes to the disk, too. Reads are bit more forgiving, as the parity drive is not accessed during reads.

Selecting the Optimal RAID Configuration

Understand Your Application

The most important factor when considering configuring RAID is to understand your application. You will optimize depending on the size and kind of IO for the different files.

Here are the basics:

Type of I/O: Is your workload predominantly sequential or random. What is the percentage of each?
Size of I/O: What is your typical read/write size in your application?
Single or multi-threaded: Is the I/O single-threaded or multi-threaded? This is important as the performance characteristics are different for each.
Alignment: Are your I/O requests made in even-sized chunks that equal the stripe size, and are they perfectly aligned? If any of the file systems, volume managers, or applications are not aligned, you can cause many performance issues.

By allocating multiple smaller RAID volumes, instead of a single large volume, you are better able to match the IO requirements to that of the chosen files. Doing this allows us to better match the stripe size to that of the writes.

Let's use the two primary files MySQL as an example. There are two critical sets of files whose IO size is different. One file is the redo log that does almost exclusively sequential writes, and the database file does random reads/writes.

The redo log files writes are single threaded and they are typically 128K bytes. The log files are relatively small and have a relatively short life. A good choice for the redo log would be RAID 1. RAID 1 data is protected by being a mirror, which doubles the space requirement but does not have the performance penalties of a parity-based configuration.

The redo log is the most latency sensitive of all the files used in MySQL. Because it is single-threaded, we want to use a device that gives us good performance with one thread. For this reason, a mirrored SDD/Nvme-based device is a good choice. The stripe should be 128K, and should match the write size. After choosing the strip size, you want to select other properties. These options are usually typical for most HW RAID controllers, and may not be appropriate is using SW RAID. The read policy should be normal (read-ahead), because when the log is read it's read sequentially, and reading ahead should increase your performance. Write policy should be write back with BBU. This will allow the controller to acknowledge the write immediately when the controller has battery backup and write through if the BBU is unavailable, as we don't want to want to lose critical data due to a power failure.

I/O policy should be disabled. This policy is only used for reads and since this a log disk, once the data is read, it likely won't be read again. Disk cache policy should be set to enable, as any cache on the drive should be consistent during a power failure.

With the database files, you should not be concerned about latency, but with throughput or the total IOPS that are being processed. Because the data has a long life, the cost/GB should be a consideration. RAID 5/50 or 6/60 is a better choice for the database files.

The next factor is the IOPS requirements for the workload. Because you will have multiple reads and write in flight, you are better able to overlap IOs. Depending on the needed IOPS rate, you have a choice of going with SSD or HDD. SSD is faster, but HDD is still significantly cheaper with regard to size. Since latency sensitivity isn't an issue, nor single threaded as with the redo log, you have the option of using multiple HDDs and spreading the IO rate across many spindles. The IOPS of a single SSD is more than 10x more than of a single HDD.

HDD is still cheaper when comes to cost/GB. It's likely that many of the databases IOPS requirements are met by HDDs, so don't be carried away with only going with SSDs. The workload and capacity will drive your decision. The database files are random and are comprised of 64K byte writes, and 16K byte reads. A stripe size of 64K is the optimal choice. The read policy should be no read ahead. Because you are accessing the data randomly, there is little to be gained by reading ahead.

For the write policy the best choice is write back with BBU. Let the RAID controller acknowledge to writes when it hits the controller rather than waiting for the disk. The I/O policy should be cached, and the disk cache policy should be enabled for SDDs and disabled for HDDs. If you have a power failure with data in the cache on a HDD, it might be lost.

Design for Reliability

Another key decision is deciding between the different RAID configuration choices. This choice is really more about reliability than performance. With the drives available today, using RAID is not always the answer. For example, using a large RAID 5 with the capacity of today's larger capacity drives can give the customer problems. A common problem is during a RAID 5 rebuild operation is that it can encounter a unrecoverable read error (URE). Using common consumer grade HDD drives that have a theoretical URE error rate of 10^14, or one failure for every twelve terabytes of reads. This translates to a 12 terabyte (3 x 4TB drives) array being rebuild having a 73 percent chance of hitting a URE and failing. Moving to an enterprise drive would increase the chances of succeeding to 88%. Either rate is alarming, as customers are looking for 100%. This is an issue that most people are not aware of.

The typical Unrecoverable Read Error rate for most drives are as follows:

A consumer magnetic disk error rate is 10^14 bits or an error every 12.5TB.
An Enterprise magnetic disk error rate is 10^15 bits or an error every 125TB.
A Consumer SSD error rates are 10^16 bits or an error every 1.25PB.
An enterprise SSD error rates are 10^17 bits or an error every 12.5PB.
A hardened SSD error rates are 10^18 bits or an error every 125PB.

A handy tool that is simple to use is here: http://www.raid-failure.com/raid5-failure.aspx or http://magj.github.io/raid-failure/. One way to deal with this problem is to reduce the number and/or size of the drives in a RAID 5/6 array to no more than four drives and go to strips of RAID 5/6 arrays to give you a RAID 50/60 array.

A Few Words on SSDs and RAID

A SSDs life expectancy is based on the number of writes. If you aren't matching your writes to your stripe, you are increasing the number of writes to the drive. With some hardware RAID controllers where you are limited to 64K for your strip size, you will increase the likelihood of doing read-modify-writes. In addition to paying a performance penalty, you are shortening the usable life of an SSD. A set of enterprise SSDs configured as a RAID 0 is many time more reliable than a RAID 6 configuration of HDDs. If you are doing a replication, you might consider RAID 0 with SSDs. If you have a failure, you can failover to the slave. If you do take this approach, you will need to be diligent in taking more frequent backups. It's recommended to use one-hour increments, so you will not lose more than a hour's worth of data if both systems were to fail.