Compared with old-fashioned disaster recovery (DR) strategies, using virtualization with automation tools like VMware Inc.'s Site Recovery Manager (SRM) as part of your DR strategy has a lot of benefits: greater flexibility, quicker and easier testing, and faster data recovery. But when you add virtualization to disaster recovery processes, there are potential pitfalls, so traditional backup methods should not be thrown away but used in concert with SRM, a virtualization expert advised.
Caveats aside, IT managers who have added virtualization to their DR toolkit are thrilled. Take the San Diego Data Processing Corporation (SDDPC), a nonprofit IT organization that handles the city's IT needs; it recently implemented VMware SRM -- which integrates with VMware Virtual Infrastructure (VI), VMware VirtualCenter and storage replication software from third-party vendors to automate failover and recovery -- and found it to be a godsend.
SDDPC consists of 25 hosts running 300 VMware virtual machines (VMs) in production, said Rick J. Scherer, a systems administrator for SDDPC. To back everything up, SDDPC also virtualized its DR environments, one in San Diego some miles from its primary location, and another in Chicago. For hardware, SDDPC used repurposed servers from its primary location, and for replication, it used SnapMirror software, which came with its NetApp storage arrays, Scherer said.
SRM, meanwhile, is essentially a tool within VMware VI that handles the recovery plan, which is stored within VMware VC). It contains the order in which virtual guests need to be brought up and the steps to get the protected site running at the recovery site. In a recent article on VMware SRM, virtualization expert David Davis explained how SRM synchronizes VC data (guest VM configurations) between the primary and backup site.
By synchronizing all VM guest configurations, including the VM guest networking, CPU, RAM and VM hierarchy within VirtualCenter, VMware's SRM performs automatic failover of VMs, changes IP addressing to provide for failover, exports a Domain Name System (DNS) script to update DNS registries, and prioritizes VM startups based on resource availability, Davis wrote.
Ultimately, testing a DR plan using SRM is less labor-intensive and quicker than testing without it because it enables users to run automated tests of their DR plans within an isolated testing environment using the recovery plan for an actual failover. This way, there are no disruptions to the environment, according to VMware.
Using virtualization for its DR operations was light years better than SDDPC's old setup. "It was more efficient and easier than having a backup location with everything backed up to a tape system. The problem of trying to get those tapes to a secondary site and hopefully being able to recover them -- the complexity and time it takes to recover from tape – was a problem," Scherer said. "We would have to install an OS to bring everything back online, and it could take days."
Though it hasn't experienced any disasters that put the new virtualized DR plan to the test, it has tested its virtualized DR plans. Compared with testing its old physical DR environment, testing the virtual one with SRM was a snap, Scherer said. "We did a DR test just a few weeks ago, and it only took half a day, whereas before it took us a full week."Replication considerations for DR sites
While new disaster recovery tools within virtualization software like SRM have made it easier than ever for businesses like SDDPC to set up a DR site with virtualization, it isn't the end-all be-all for DR, said Mike Laverick, a VMware and Citrix Certified Instructor who wrote a book on VMware's Site Recovery Manager.
"In the past, VMware left it up to us to engineer scripted solutions to handle the storage, registration, power-on and other events that allowed [disaster recovery] to happen. Now we have VMware's SRM, those problems and hassles have been taken away, or at least much reduced," Laverick said. "But SRM is not a silver bullet cure-all for your DR ills. It's only a small, but important, part of an overall DR strategy."
While array-based replication plus SRM are great for a catastrophic loss of the production site, they're not designed for minor, nondisaster-type losses, like a single server or small amounts of data, Laverick said, and are not a replacement for traditional file-level backups. "First, if an individual file or database or mail store became corrupt, it would be infinitely faster and less intrusive to restore the lost or corrupt data via conventional backups, [because] array-based replication is not granular enough to deal with this scenario."
Nor does replicating your data to a remote site eliminate the need to test backups. "If your data in the production location is riddled with a damaging virus, there's a good chance that virus would be merely replicated at the DR location," Laverick said. Garbage in, garbage out, as it were. "Whilst it's true that triggering DR with SRM is faster and quicker than restoring potentially terabytes of data from slow media like tape, it doesn't mean you don't have to backup and test your restores." Poor data and storage planning can also be disastrous to virtualization in a DR environment, Laverick said. Someone needs to sit down and work out how many logical unit numbers (LUNs)/volume they need and what properties they will have in terms of:
- LUNs/volume size
- Number of spindles and the amount of I/O they can take
- RAID levels
- Replication levels (synchronous, asynchronous with latency) – and whether these levels of replication will deliver on your recovery point objectives or recovery time objectives
These properties are important, "because when you replicate a LUN/volume, everything in that LUN/volume gets replicated, and you want to make darn sure you're not replicating a VM you did not need at the DR location. That's a waste of time, bandwidth and space," Laverick said. "Those three things also cost money in some shape or form."
IT managers can also avoid backup issues by ensuring that those who create VMs put them on the right level of storage to avoid LUN/volume issues, he said. When users create VMs, they rarely consider whether that LUN is being replicated and how frequently it is being duplicated to the DR location, Laverick said.The DR site as production environment gateway
While virtualization can be used for DR even if production environments aren't virtualized, this is often short-lived; the conservative IT folks who resist using virtualization in production usually get hooked after seeing it work in a DR environment, Laverick said.
"Quite frequently my customers say to me, 'We're going to use virtualization in our test and dev environments first, and then roll it out to our DR environment before we go live in production.' I can understand the rationale around these statements. It's about being cautious. … But once you have virtual machines in DR, it is almost inevitable you will want them in production," Laverick said.
Keeping virtual and physical environments in sync can also be problematic. Tools for physical to virtual migration from vendors like PlateSpin Ltd can help, but generally speaking, "it is such a royal PITA," he said. "The quicker the production and DR both run on virtual, the quicker you can just replicate the production VMs to the DR location and have done with it."