Networking infrastructure equipment enjoys greater reliability by architecting for high availability (HA) and deploying a mix of commercial off-the-shelf (COTS) hardware as well as commercial and open source software components. Systems at the core and the edge of the network, once highly dependent upon custom and proprietary platforms, today build on standards-based carrier grade OSes, service availability forum APIs and AdvancedTCA hardware, and boast five and six nines of availability.
By combining key HA technologies and practices with virtualization, data centers can also realize benefits of higher availability for existing mainstream data center hardware and software platforms. The tip explains the essentials of HA and how to use high-availability methods to increase data center availability.
High availability defined and measured
Availability is commonly expressed as the ratio of acceptable system uptime to the total time in a given period, most often in one year. So, if your installation can tolerate one day of downtime in the course of 365, then your required availability equals 364/365 or 99.73%.
Systems offering high degrees of availability promote themselves in terms of the number of nines supported. Highly available systems boast four, five or six nines.
[TABLE]
In the real world, downtime is expressed from statistically obtained values for mean time to failure (MTTF). As important as downtime is the time needed to repair a fault – mean time to repair (MTTR).
Availability, then, is calculated as:
Availability = MTTF / (MTTF + MTTR)
If a system or component offers 50,000 hours MTTF, and it takes on average 15 minutes to repair or replace it (e.g., to find and swap out a disk or a blade), then availability for that system would equal 99.9995%, or five nines.
Using this formula, it is easy to see how architects can enhance total availability by using more reliable hardware and soft
To continue reading for free, register below or login
To read more you must become a member of SearchServerVirtualization.com
');
// -->

ware components – thereby increasing MTTF – and/or by reducing the duration and impact of faults – by decreasing MTTR.
HA: Not one size fits all
Laypeople tend to think about catastrophic IT equipment failures lasting hours or days. By contrast, networked data voice infrastructure systems are optimized to tolerate many and frequent short outages, each often less than one second, and to recover quickly and gracefully.
In datacom and telecom, HA capability builds on a mix of specialized and COTS hardware and software. Today that mix includes advanced TCA blades, redundant Ethernet, RAID, Carrier Grade Linux, journaling file systems and HA middleware. Data centers and other enterprise IT locales can also improve availability with more conventional hardware and software.
Deploying these and other technologies helps effect greater availability by
HA system architects achieve this first design goal primarily through redundancy, in particular by provisioning spare hardware and software in varying states of readiness:
In general, the hotter the spare, the more expensive the solution.
The second design goal – accelerating fault detection, isolation and resolution – can build on existing fault detection mechanisms, like device driver time-outs and protocol retry. The following technologies increase availability by streamlining failover, periodically poling the state of running applications, backing up and synchronizing state information for running hardware and software:
Leveraging virtualization for High Availability
The traditional locus of increasing availability in enterprise IT has been clustering, in which multiple systems or blades are loosely coupled together to act as a single system. Clustering solutions, unfortunately, have suffered from highly proprietary and intrusive implementation, and from conflicting design goals.
Clustering paradigms tend to force both independent software vendors and end users to use customize deployments to fit the architectures and APIs specific to vendors and their particular solutions. While unmodified production and legacy code do benefit from simple rehosting on clustered environments, the greatest benefits are realized through more thoroughgoing, intrusive and costly migration. Moreover, most clustering solutions tend first to focus on performance and load balancing, and second on enhancing availability; those that start with availability as a design goal usually offer lackluster performance.
As an alternative, virtualization can provide an economical platform for higher availability, hosting multiple redundant virtual instances of critical systems and resources rather than provisioning additional hardware. IT managers can gain availability from explicit redundant deployment of systems and applications in virtual machines, or implicitly, as pointed out by Fadi Nasser of embedded virtualization supplier Virtual Logix: "Virtualization lets enterprise appliances achieve higher availability with software techniques that inexpensively mimic traditional dedicated hardware-centric HA systems."
With minimal, incremental investments, IT managers can use virtualization as an HA platform through:
Virtualization and a little scripting can be used to implement traditional HA constructs:
The gotchas
Some HA techniques and technologies, however, outstrip the capabilities of virtualization platforms:
Conclusion
IT managers and architects can look to a rich and varied toolbox containing both commercial and community resources for enhancing availability. They gain new tools by combining COTS virtualization with HA techniques, platforms and middleware. Enterprise virtualization platform suppliers like VMware are starting to offer basic HA functionality in their product lines, with more aggressive approaches by embedded virtualization suppliers that cater to networking infrastructure. You can also leverage commercial and open source middleware for health monitoring, heart beating and failover, where the managed objects are no longer physical blades or interfaces but virtual machines, guest OSes and applications running on them.
A good place to start is your own installation's history of faults and costly downtime. Make incremental investments to protect your most critical resources, like redundant provisioning across virtual machines and abstraction/virtualization of key network interfaces.
Ultimately, virtualization is just another tool to use to enhance availability and reliability. The heuristics and mechanisms described in this article will not themselves guarantee better uptime and faster fault resolution unless they are integrated into a comprehensive policy regime.