LAS VEGAS -- Virtual machine portability between sites is a hot topic at VMworld 2011, with some users envisioning new non-disruptive disaster recovery scenarios using stretched clusters.
However, experts warned, there are a number of caveats -- and some misconceptions -- about the feasibility of stretched clusters for disaster recovery (DR).
New developments in vSphere 5 spurred much of this talk about virtual machine (VM) portability between sites. VSphere 5 now supports Metro vMotion over network links with latencies up to 10 milliseconds (provided users have the Enterprise Plus licensing level). A preview of a new technology called VXLAN, which would provide VMs the isolation and segmentation benefits of layer 3 networks, also brought VM mobility to the forefront.
Some attendees’ thoughts turned to DR in discussions about stretched clusters, which are typically set up over two sites separated by a distance up to 100 kilometers. These clusters are managed by a single Virtual Center (vCenter) and occupy a single IP address space, so VMs can be migrated by vMotion non-disruptively.
The ultimate vision is non-disruptive DR, because VMware’s Site Recovery Manager (SRM) still requires a “cut-over” period -- typically about 15 to 20 minutes. SRM also typically fails over an entire site rather than individual components or VMs.
“Not having to fail over the entire site and just run a component of services where people are working is appealing, and uses resources more effectively,” said one virtualization architect working for a county in the Midwest, who asked not to be named. Severe storms hit the county in July, cutting communication to critical systems such as 911 and emergency management services.
“Stretch clusters might have made communication possible at individual locations even if the primary site was unavailable,” he said.
Stretched clustering lessons learned
Financial services company TIAA-CREF has been working on stretched clustering using vSphere 4.1 for the last 14 months or so, according to senior IT engineers speaking at a VMworld session Wednesday.
The company started looking for an alternative to SRM’s full-site failover to provide high availability for each component of the infrastructure -- storage, networking and compute.
“What that means to us is [that] any specific part of the stack…needs to be able to survive on its own,” said senior IT Engineer Glenn Walker. “If one fails, we don’t want to have to move the entire site over…that’s a pretty big hammer approach. Businesses usually don’t like that too much.”
But it’s not a replacement for traditional DR. In fact, TIAA-CREF still uses a disaster recovery site in addition to the data centers involved, and the stretched cluster has yet to make it out of the proof-of-concept stage for a number of reasons, the engineers said.
Among the caveats around stretched clusters is the potential for a “split brain” scenario, in which the two sites lose network connectivity to each other, but both remain “alive.” To work around this, TIAA-CREF cobbled together a kind of “quorum” node using its DR site to monitor connectivity between the sites. This is done using a PERL script from NetApp called TieBreaker.
TIAA-CREF has also had to develop its own custom plug-ins for vCenter Orchestrator to monitor the resources at each site so VMs don’t get separated from their associated storage. Such a split can cause performance problems.
There are other issues which remain unresolved. For example, a stretched cluster uses a single vCenter. TIAA-CREF uses vCenter Server Heartbeat and clustered the SQL database underlying vCenter to try to get around this. The company is also looking at ways to make its Cisco Nexus 1000V Virtual Supervisor Module available at both sites.
“Today there’s no real solution for that. There’s some published documentation around how to do that, but it’s really keeping both nodes in a pair within one site,” said Andy Daniel, senior IT engineer for TIAA-CREF.
This is all not to mention the amount of specialized “semi-production” gear the company acquired for this proof of concept. That includes Cisco’s entire Nexus switch line, mirrored high-end arrays from NetApp, and specialized networking and storage services such as Cisco’s Overlay Transport Virtualization and NetApp’s MetroCluster.
Disaster recovery vs. disaster avoidance
The idea of failover without downtime is appealing, but VMware and EMC Corp. officials emphasized in a another session that disaster recovery and disaster avoidance scenarios are often misconstrued.
High availability between sites really only works when IT admins can see an outage coming, said Lee Dilworth, specialist system engineer in northern Europe for VMware.
Traditional DR methods typically respond to an unexpected event which has already occurred. But “customers get involved in trying to decide which one of these solutions they want very quickly without really thinking about what’s driving them to make that decision,” he said. “What is the business case for doing it? If people are spending a lot of money on network infrastructure to make [a stretched cluster] happen…it may be simpler to just buy extra capacity at both sites…and put a DR solution in between.”
Dilworth’s co-presenter, Chad Sakac, vice president of the VMware strategic alliance for parent company EMC, said there are use cases for stretched clusters, such as planned migration at a hospital or similar organization that cannot tolerate downtime. But he estimated that this population represents about 10% of VMware’s install base. There are about 5,000 SRM users in the world, Sakac said, and “many tens to very low hundreds of happy, functioning stretch-cluster users.”
Roadmap to bring DR and avoidance closer together
Despite the warnings about stretched clustering limitations today, VMware officials said the company is working on ways to make it easier. Site affinity rules may come to VMware HA (now called Fault Domain Manager, or FDM) in the future, according to Tom Stephens, senior technical marketing architect for VMware.
“We’re also looking at ways of having multiple vCenters in one site,” he said.
Currently, FDM doesn’t monitor the health of physical resources on the ESXi host such as the host bus adapter or the network interface card, but both Stephens and Dilworth said component protection, targeted at stretched cluster scenarios, is in the works so workloads are not placed on hosts that may be failing.
There may also be new topologies supported in the future that bring together stretched clusters and longer-distance DR between three sites, with sites A and B representing a stretched cluster for HA, and site C with SRM’s asynchronous replication for DR.
In practice, if not technically, users “want to merge disaster avoidance and DR together,” Dilworth acknowledged. “That will come eventually.”
Check out our full VMworld 2011 conference coverage.
Beth Pariseau is a senior news writer for SearchServerVirtualization.com. Write to her at firstname.lastname@example.org.