When it comes to creating resiliency to ensure system availability, choosing the right level of redundancy for...
your virtual infrastructure is crucial, but figuring out exactly what degree of redundancy you need can be tricky. For some organizations, an N+1 redundancy permutation is more than enough to create resiliency and ensure robust performance. Others may require an even greater degree of redundancy and may opt for an N+2 or N+1+1 permutation.
So how exactly can you gauge what level of redundancy is right for your data center? We reached out to industry experts and asked them to share their thoughts on what you should take into consideration when making your decision.
Alastair Cooke, independent analyst and consultant
Redundancy can be built into multiple layers of any application and infrastructure. Generally, the closer the redundancy is to the application, the higher the availability that can be delivered. A dynamic, load balanced web server farm is more available than a single web server in a VM. The issue is that each application has a different resiliency method and tool set. Providing resiliency in the lower hardware and infrastructure levels allows the same tools to be used across multiple applications. The value of application level resiliency needs to be weighed against the cost of managing that resiliency.
Jim O'Reilly, Volanto
Data integrity requires some form of storage redundancy. In the era of spinning disks, that meant RAID arrays, but they suffered drawbacks from the start. Controller boards weren't super reliable, so most arrays used two controllers, which required very expensive dual-port enterprise drives. All of this drove the price of data integrity into the stratosphere.
Most users configured RAID as either a mirror with two copies or as RAID5 with an extra parity drive. This worked well enough until drives got into the terabyte range, where rebuild times for a failed drive became so long that another drive in the RAID set was likely to fail. RAID6, featuring two parity drives, was introduced to solve this problem, but ultimately led to an overall drop in performance.
Solid-state drives killed the RAID array, simply because RAID controllers couldn't keep up. Nowadays, the sweet spot in the storage industry is the 12-drive 2U box, using an x86 or ARM controller. It doesn't require redundant power or controllers, and doesn't use RAID parity. Instead, it replicates data between boxes, so a whole appliance can die and still be recovered. It often uses 3-way replication and keeps an extra copy at a distant site to protect against disasters. Amazon Web Services' vaunted Simple Storage Service (S3) uses this method.
We're now looking at erasure coding (EC) approaches that add extra code to each stripe of data, similar to RAID, but with as many as 20 data and six EC sectors in each stripe written to storage. Although EC takes a lot of compute power, any 20 of the 26 drives can deliver data. The elegance is that the 26 drives can be spread over many appliances and, in the example given, can tolerate six drives or six appliances failing.
So what should you use? If speed is the issue, replication is your best bet. In fact, replicating to two appliances or servers is the option of choice. Conversion of colder data to secondary storage involves creating the ECs -- in background -- and then writing the stripes out using Ceph or some other software.
Many all-flash arrays build in EC. Data written to them is journaled in a mirror file, and then processed to be placed in a permanent space. This is where EC, together with data compression, can be used to save space.
Object storage using the representational state transfer access method tends to use replication just as in S3, but the same idea of journaling and background processing to erasure coded data is becoming more popular since it uses about half the storage space.
One thing is certain; data redundancy is an absolute requirement for almost all computing. Not providing it means jobs must restart from scratch and, of course, important data may be lost.
Brian Kirsch, Milwaukee Area Technical College
Redundancy poses a particular problem for many organizations; it's absolutely necessary, but exactly what level of redundancy do you really need? Years ago, I heard a CIO state that he wanted "redundant everything" for a tier-one application, but when he saw the price tag for the second storage area network and fiber network, the "redundant everything" direction changed dramatically. In today's data center, redundancy needs to align with the business' goals and costs. It's essential to know what level of redundancy your business needs, because virtual infrastructure redundancy can cover a very wide range of technologies and costs.
For many virtualized infrastructures, traditional hardware redundancies with additional network and power connections serve as the baseline. These are often combined with an N+1 redundancy level capacity on the hosts to allow for one failure/maintenance window. While it's possible to increase this ratio for greater redundancy, you'll be faced with underutilized resources.
The necessary level of redundancy will be in a constant state of flux in order to accommodate your business needs. One key thing to remember when working with a virtualized environment is that the system you are protecting is not a single system -- it may hold dozens of systems and applications within it, so you should use a higher level of hardware redundancy than you would for a traditional server.
On the software side, you might involve fault-tolerant VMs with distributed resource scheduler for some and high availability for others. The nice thing about a virtualized environment is you have the flexibility to do both in the same infrastructure, but it may introduce additional licensing costs.
One other thing to consider is the redundancy of your virtualized environment's management system. Management often gets overlooked in favor of hosts and guests, which is huge mistake, since management is essential in the event of failure. Your virtualized environment's overall redundancy should always conform to your stated service-level agreement and you should ensure you include all of the pieces to make that happen, including management.
Is N+1 redundancy the best choice for data center resiliency?
Prevent downtime by calculating cloud resiliency
Create a comprehensive plan for managing data center redundancy