If your organization has embarked on using virtualization but now struggles with more advanced deployments, you’re not alone. Like others, you’ve probably incurred benefits from your first phase of deployment, but the second phase presents its own challenges.
Many companies first delved into server virtualization by consolidating servers to achieve cost savings and reduce power consumption, conserve space, and eliminate idle physical resources. But now they’ve expanded deployment to encompass new goals like improved disaster recovery (DR). Indeed, according to the Data Center Decisions 2008 Purchasing Intentions Survey of more than 600 IT professionals, the second most popular use of server virtualization deployment was for disaster recovery, with more than 40% of respondents using virtualization technologies for DR.
Now, with server virtualization, organizations can create virtual machines (VMs) that serve as backup servers when primary servers fail or in the event of a data center disaster. Virtualization enables organizations to meet shorter recovery time objectives (RTOs) using these virtual failover servers without incurring the cost and space concerns created by idle physical servers. Virtualization may also improve systems management along the way by leveraging seamless data and system image movement across physical platforms.
In the context of VMware Inc.’s server virtualization—which currently holds 60% of virtualization market share— data center managers can implement DR in production environments and exploit features such as VMware’s High Availability (HA) to offer protection from single-system failure. With VMware virtualization technology, when a host server fails, virtual machines running on a host are restarted on another ESX host using the Virtual Machine File System (VMFS), which gives multiple VMs read-and-write access to VM files on shared storage. Because of the pervasiveness of VMware’s platform, this discussion focuses on DR strategies with VMware.
Nonetheless, disaster recovery in virtual environments is hardly without caveats—particularly in the area of data management and recovery. With large volumes of data, recovery time delays can occur, and some backup recovery methods may not be granular enough to retrieve data. As virtualization backup and management tools become more sophisticated, these tools have begun to address some of the nagging challenges. This article examines the benefits and pitfalls of a virtualized disaster recovery infrastructure while emphasizing that no one-size-fits-all architecture can suit every company’s needs.
Why virtualize DR?
When you consider the physical and financial resources that are required to be prepared for massive server failure or site-wide disaster, the reasons for virtualizing DR quickly become obvious.
For any data center with a significant number of servers and a recovery time objective of 24 hours to 48 hours, it’s clear that such RTOs cannot possibly be met in the event of a site-wide disaster unless a significant number of stand-by systems are ready to take over at an alternate location. In most cases, if new systems have to be purchased to achieve this goal, the initial hardware procurement delay alone would fail to meet the designated RTO, let alone the time needed to rebuild systems and restore data. And the procurement cost for physical standby systems simply does not compare with the costs associated with VMs designated for disaster recovery.
A few years ago, for example, a major recovery site provider informed a large Canadian oil corporation that it could not install and configure the company’s 800 servers within 48 hours of a disaster. The prospect of purchasing 800 servers to recover from a disaster that might never occur drove the company to seek alternate technology options and, ultimately, to the adoption of a virtualization-based DR strategy and the adoption of virtualization in the company’s production environment as well.
Reducing power consumption costs is also a motivation in virtual DR strategies. Virtualizing disaster recovery extends the power and cooling savings gained from server consolidation. Based on the current average cost of power of 10 cents per kWh in the U.S. commercial sector, over a period of three years it would cost more than $2,000 to power a single 750-watt server if we also factor in a likely rate increase. By comparison, 10 VMs running on the same 750-watt server converted to an ESX host represent a 10-to-1 server reduction. If 10 250- watt physical servers were virtualized onto a single 750-watt server, it would equal an economy of approximately $4,700, or 70% over the same period, not including the cooling savings.
And of course, an aging power distribution infrastructure provides yet another reason to ensure a solid DR strategy. The increasing demand for power makes it more likely that data loss will occur because data centers have run out of capacity and now suffer from more frequent power outages and shortages.
A large North American parking services project epitomizes the benefits of virtualizing DR. It virtualized 40 soon to-be-replaced physical servers and an additional 20 that were deployed on two VMware ESX Server hosts. The company implemented NetApp’s SnapMirror technology to replicate application data to another city that is more than 600 miles away. In the event of a disaster at the main site, the DR site has another two ESX hosts ready to take over. This strategy saved the company nearly $500,000 in hardware replacement costs, enabled network card and port count reduction and reduced cooling and power costs. For “gold level” systems, the time it took to recover went from days to less than one hour.
What to virtualize
Most data centers have numerous systems running applications that are not necessarily resource-intensive, and they are ideal candidates for virtualization. These systems can be consolidated as multiple VMs onto a single physical system while maintaining a single-application-per-server status that may be driven by security or imposed by legacy applications running on different versions of an operating system.
There is also nothing wrong with running a single VM on an ESX host to take advantage of VMware HA and other features for a more resource-intensive application. This approach somewhat defeats the financial benefits of server consolidation and green computing, but it is still very much aligned with the DR capabilities that server virtualization offers.
But most virtualization deployments strive for higher server consolidation and cost reduction ratios. Let’s look at an example of a project where the right servers were consolidated for the right reasons: A U.S. health-care provider consolidated physical servers as nearly 100 virtual machines running on six ESX hosts. This consolidation effort saved the company $260,000 in hardware costs alone, not including power savings. Combining its multisite storage area network (SAN) with VMware, the company used synchronous data replication between two data centers in a campus setting and VMware HA to provide automated failover and reduce application downtime to seconds (where synchronous replication enables more rigorous recovery by backing up data from the exact moment of system failure). Regular DR testing has demonstrated seamless failover.
Server virtualization mitigates procurement delays and high costs typically associated with deploying a physical standby DR infrastructure. But having hundreds of virtual machines already deployed at an alternate site does not help much without business records. So we need to give serious consideration to the protection of an ever-growing volume of data.
Consider a medium-sized data center that houses 500 servers, physical or virtual, each with approximately 50 GB of data on average. These conservative numbers add up to 25 TB of data, which is not uncommon these days. With traditional methods, the thought of having to back up this amount of data, let alone ever having to restore it all at once, is unpleasant. So let’s have a look at some of the options.
Backup agents on all VMs. This method can be considered status quo, because it does not take advantage of virtualization beyond offering a reduced software licensing cost in cases where vendors use a CPU-based model. The I/O-intensive nature of backups can also quickly deplete all the resources on an ESX host during backups.
VMware VCB. VMware Consolidated Backup (VCB) technology is part of VMware and enables the integration of leading backup products with the virtualized infrastructure. Among other tasks, it provides a proxy mechanism that moves the backup I/O load to another server sharing storage with the VMs requiring backup. It backs up VMs by copying virtual machine disk files, or vmdk files. These files can be copied to a remote site via a wide area network (WAN) to offer DR protection.
Vizioncore vRanger. Vizioncore’s vRanger further automates the backups of entire VMs while removing the load from the ESX server. Like VCB, it backs up vmdk files but can also back up a physical system to a virtual machine file by using a conversion process known as physical-to-virtual (P2V) migration. This provides the ability to back up a physical system in a production environment and restore it to a VM at an alternate recovery site.
Veeam Backup and esXpress. Veeam Backup and esXpress are other software products providing backup services specifically tailored to a VMware infrastructure. Veeam Backup leverages Windows Volume Shadow Copy Service (VSS) functionality to ensure consistent application backups, and esXpress offers differential backups for vmdk files. With the footprint of VMs constantly growing and as backup software providers continue to improve VMware integration with existing products, we are likely to see other specialized backup tools emerge.
Nonetheless, all the backup methods described above involve copying large amounts of data. The traditional backup method using backup agents on VMs offers no improvement to the ability to meet RTO and recovery point objectives (RPOs). Large restore operations can still take well beyond the established RTOs and daily backups may not be granular enough to meet stringent RPOs.
While VCB and other vmdk file backups can improve the ability to meet RTOs by copying the files across a WAN to a remote site, they may still fail to meet RPOs for certain transaction-sensitive applications and, therefore, result in potential data loss. So too, vmdk file backups are point-in-time copies that may not be as granular as some applications require. For example, a point-intime copy of a database may not be complete unless transaction logs are also captured, which is not always possible when disaster strikes.
Data replication offers the next level of data protection that, when applied in real time, can help satisfy tighter RPOs than can point-in-time copies. Beyond point-in-time replication, real-time data replication technology can be divided into two groups: synchronous and asynchronous. Synchronous replication writes changed data to the local and target disk in a synchronized fashion. It is fundamentally a mirror image of the source data. Asynchronous replication, on the other hand, allows some changes to be buffered in memory if delays in copying changed data to the target storage device are introduced because of distance, network latency and other factors.
Data replication technologies can be further divided into the categories of host-based replication and storage array based replication.
Host-based data replication. In the context of a virtualized infrastructure, host-based replication leverages a software tool installed on a physical server, a VM or a proxy and replicates data to another local VM or, for complete DR protection, a VM at a remote site. One of the advantages of host-based replication is that the process is hardware-agnostic. For example, data can be replicated from a physical system attached to an EMC SAN to a VM using NetApp storage.
Some of the real-time replication tools available on the market include DoubleTake, Neverfail, Veritas Replication Exec and CA’s XOsoft. Examples of point-intime replication tools include Vizioncore’s vReplicator and PlateSpin’s Forge. It should be noted that the frequency of the replication schedule has direct impact on the ability to meet RPOs with point-in-time replication software. Frequent replication throughout the day increases the granularity of recoverypoint capabilities. Conversely, with respect to meeting RPOs, a daily replication cycle offers little improvement over a traditional backup product.
Some host-based replication software providers also offer the ability to provide an automated failover mechanism. This is the case with Double-Take, Neverfail and XOsoft, which offer integrated high availability combined with their respective replication components. In the event of a system failure, this feature allows a VM to fail over to another VM regardless of the hardware or storage platform.
Array-based data replication. Array-based replication is handled at the storage device level and does not necessarily rely on a server for functionality. Beyond the initial copy, only changed data blocks are sent to a remote array and applied to the remote copy. Like host-based replication, array-based replication can be point-in-time or real-time synchronous or asynchronous. Traditionally, array based replication imposed the restriction of having to take place between similar storage arrays. But cooperation between storage providers has now yielded improved ability to replicate between dissimilar arrays. With replication taking place at the array level, array-based replication usually offers improved performance over its host-based counterpart because it uses arrays’ hardware resources to handle processing and buffering overhead.
Although data replication can bring a VMware-based recovery strategy much closer to meeting most recovery objectives, it has its limitations. In the event of data corruption or deletion, these limitations become evident. With “no questions asked,” any transaction that results in corrupt or deleted data on the source of storage will be replicated to the target storage. This is where a combination of real-time and point-in-time copies can save the day. If data becomes corrupt, it can be rolled back to an earlier point-in time copy. Whether host or array based, another caveat with basic data replication is that it sometimes lacks application awareness and requires additional modules. This is the case when replicating database data; the replication tool must be integrated with a database to avoid inconsistencies.
Some of the mainstream storage vendors—EMC, IBM, NetApp, HP and Hitachi, to name a few—have developed reliable and sophisticated data replication software suites that integrate with applications. Examples of this capability are the NetApp Snap Manager for Oracle or SQL utilities that ensure consistent application backups.
Site Recovery Manager. VMware’s Site Recovery Manager (SRM) provides a new twist on storage array-based replication. Managed from VMware’s VirtualCenter, SRM ties into existing array based replication tools to replicate virtual machines using a special adapter known as a storage replication adapter, or SRA. This adapter is special in the sense that it must be listed on the VMware Hardware Compatibility List. Currently, storage array vendors are responsible for developing compatible adapters, and supported storage arrays also must be of the same model. Site Recovery Manager (SRM) is a promising first-generation technology that offers valuable features such as failover capability for DR testing and actual recovery. It should be noted, however, that SRM focuses on the recovery of VMs and storage; it does not currently integrate with applications discussed earlier, and therefore additional steps should be taken to protect applications. SRM also requires access to an Oracle or SQL database to keep track of the metadata and VM configuration files (i.e., vmsd, vmx and vmxf) in a data store.
As administrators work with replication technologies, they often have questions about what should be backed up. While it may appear simple enough to say that whatever is on storage array one should be replicated to array two, VMware requires additional considerations.
Most replication technologies integrated with VMware replicate vmdk files on VMware’s VMFS. If some VMs are allocated logical unit numbers on a storage array to store application data outside of a vmdk file, this data must be replicated by other means and made available to the replicated VM at the remote site. When SRM is used to interface with the array-based replication utility, this is the case. Conversely, if non-VMware-aware replication technology is used, care must be taken to ensure configuration files are also replicated.
The next level of virtualization in a disaster recovery strategy is storage virtualization. The technology was initially developed to provide centralized storage management by creating a layer of abstraction that enables heterogeneous storage arrays to be managed holistically as a single storage pool. It is that same layer of abstraction that makes storage virtualization an interesting DR option. Because storage arrays in the pool can occupy different locations, data movement and replication can take place transparently to provide a remote DR copy. While at first glance storage virtualization may seem similar to array based data replication, it offers far more flexibility and manageability than simple mirroring.
As we’ve seen, data considerations are probably the most complex aspect of developing a recovery strategy with VMware. We have to accept the fact that there might not be a one-size-fits-all data protection strategy for Vmware unless we have access to infinite funding that allows us to deploy top-of-the-line replication technology for everything. It might be necessary to implement different tiers of data protection that can range from real-time, synchronous replication for top-tier critical applications to VMware-integrated daily backups for lower-tier systems.
With all these data replication options, there are nonetheless limitations to the level of data protection they provide. High rates of data change, long distances between source and target, and latency are all elements that may introduce data replication delays. Even with buffering and redundancies, data loss is still possible when data replication is interrupted unexpectedly, which can definitely happen in the event of a disaster.
Although virtual servers may be ready to take over processing duty at an alternate site, data consistency must be ensured for specific applications such as databases. The key to more seamless recoverability is a clear understanding of the impact of inconsistent or lost data on applications and business processes.
As virtual machines get replicated to a remote site for DR purposes, user access and authentication become important considerations. Services such as domain name server and Active Directory must also be available at the alternate site to allow user access to alternate infrastructure in the event of a disaster. As obvious as the requirement may seem, user connectivity in the face of disaster is often one of the last items considered.
Virtualization can be further exploited to extend recoverability all the way to the desktop with VMware’s Virtual Desktop Infrastructure (VDI) technology. VDI is built on a concept with which Citrix Systems’ users are familiar. VDI alleviates the need to deploy physical desktops at a recovery site to recreate a work environment. The main components include a thin client or Web browser for end users; a connection broker, such as VMware’s Virtual Desktop Manager or a third party’s such as Citrix; and, finally, the ESX host where virtual desktops run on VMs.
When end users work from the same facility where the production data center is located, this kind of recovery strategy can be vital. If server virtualization is to ensure the rapid recovery of a data center, it only makes sense that the same recovery capabilities should be extended to end users. While applications will easily resume processing on VMs at a remote site, displacing an entire workforce involves a human factor that is far more complex and unpredictable. With the increasing popularity of telecommuting, workplaces may have already implemented some form of virtual desktop, so integrating this approach into DR planning and infrastructure is a natural fit.
You still need a plan
Unlike small organizations where a few skilled IT staff members might claim they could recover their entire IT infrastructure without much documentation, data center IT staffers cannot make the same claim. Even when server virtualization, data replication and automated failover are implemented at an alternate site, complete and accurate configuration documentation of the technology in place must be created and maintained on a regular basis.
Unavoidable changes to an IT infrastructure are a disaster recovery strategy’s worst enemy. After a configuration change or the addition of new IT components, what worked yesterday might no longer work. A disaster recovery plan that is well integrated with change management can capture those changes, and frequent plan testing ensures that the necessary revisions to the plan are made. A comprehensive DR plan should also include the manual processes that cannot be automated, such as disaster alert management, notification, situational assessment and disaster declaration, but that are necessary to the timely and successful resumption of regular business activity.
To protect against hardware failure or site-wide disaster, virtualizing a disaster recovery infrastructure has numerous advantages over maintaining physical standby resources. And as data centers confront rising energy costs and a shaky economy, finding ways to economize on data center operations will become only more common, if not mandated. In addition to cost savings on hardware and energy consumption, creating virtual machines and designating them for disaster recovery enables organizations to recover more quickly and seamlessly from failure. Additional capabilities provided by methods such as data replication and storage virtualization enable organizations to move data transparently between sites to create remote copies for disaster recovery purposes.
Still, ever-growing volumes of data, long distances between facilities, and the challenges associated with protecting applications and data can introduce problems that are not always addressed by virtualized disaster recovery infrastructure. And of course, there is no one-size-fits-all framework that can protect all organizations’ data and applications. Moreover, organizations need to consider protecting not only primary site location assets but also employees’ access to those resources. Approaches such as desktop virtualization need to be part of any thoroughgoing DR equation. Finally, organizations need to undertake diligent documentation practices to log those manual processes that are part of any effective DR strategy
About the Author
Pierre Dorion is the Data Center Practice director and a senior consultant at Long View Systems Inc. in Phoenix, Ariz. Dorion is also a Certified Business Continuity Professional specializing in the areas of business continuity and disaster recovery planning. Dorion has appeared as a guest speaker on the subject of disaster recovery and IT resilience at several conferences, including Storage Decisions, ARMA, AFCOM and CIPS.