A new product announced by VMware parent company EMC Corp. this week could lay the groundwork for live migration of virtual servers over large geographic distances. But storage is just one part of that battle and some experts question whether the cost of long-distance vMotion will be justifiable even when the technology is ready for prime time.
Distance vMotion, a concept discussed in VMware circles since at least 2009, already exists in the wild today, and VPlex Geo is third in a line of storage federation products EMC introduced a year ago to support it. The first two models, VPlex Local and VPlex Metro , are available and used in production for federated storage access across distances of up to 100 kilometers.
The introduction of VPlex Geo means it’s now technically possible to perform live migrations of VMs and their associated data across thousands of miles. A VPlex Global model is also still waiting in the wings, which EMC says will support cross-continental live migrations.
These products could at least theoretically change disaster recovery and automated high availability in virtual environments, if they take off. The ability to federate storage with transactionally consistent caching over distance – meaning data centers that are miles apart can be pooled and used as one resource, a step beyond replication-driven HA and failover between sites — also has potential uses for data mobility (“cloud bursting”) and multi-site collaboration.
Devil’s in the details
However, three factors hamper long-distance vMotion: its low tolerance for network latency; networking bandwidth requirements; and cost.
Scott Lowe, VMware-Cisco solutions principal for EMC, emphasized in Tweets on Tuesday that the existence of VPlex Geo means does not mean immediate availability of fully supported, production-ready vMotion over asynchronous distances, “Not yet.” Lowe said. “Need to watch vMotion [Round Trip Time] limits.” On Wednesday, he added, “see VMware’s support statements regarding requirements for long-distance vMotion.”
According to the document, “Virtual Machine Mobility with VMware VMotion and Cisco Data Center Interconnect Technologies,” the maximum latency between the two VMware vSphere servers cannot exceed 5 milliseconds (ms) in any case. This could change, of course, but has yet to do so. No official support statement for vMotion over asynchronous distances has been issued by either VMware or EMC.
Meanwhile, the VMware / Cisco support document for distance vMotion also includes the requirement that “the IP subnet on which the virtual machine resides must be accessible from both the source and destination VMware ESX servers.” This is also known as Layer 2 adjacency, and the current support document calls for an IP network with a minimum bandwidth of 622 Mbps between sites to accomodate it. Such a configuration over distance also involves logically “stretching” the Layer 2 domain, through technologies such as Cisco’s Overlay Transport Virtualization (OTV).
If vendors build it, will users pay?
With time, the technical and support barriers to performing live migrations with local data access can easily be broken down. But that’s where Wikibon analyst David Floyer points out that high costs, especially for network bandwidth, will figure heavily into any business justification for the technology.
VPlex Local and Metro give users a healthy return on investment, Floyer pointed out. For example, VPlex Metro used with Oracle’s Real Application Clustering can allow effective stretched clustering well over the application’s native tolerance for 1 kilometer’s worth of latency, without requiring 622 Mbps of bandwidth between sites. But when it comes to mobility over longer distances, especially for high availability and disaster recovery, the cost of the bandwidth required to overcome the low tolerance for latency in vMotion today is out of the reach of most users, Floyer said.
Moreover, he asked: what is the use case? “Why would you do [live migration over distance]?” he asked. If it’s load-balancing between data centers more than 100 km apart, simply adding processing capacity where it’s needed would still be cheaper and less complex than migrating over big network links. Similarly, existing replication-based disaster recovery technologies will probably work well for the majority of enterprises, he said.
“It’s better to keep data where it is, and move the workload to it when you have to,” according to Floyer. Long-distance vMotion “is a lovely theory, but the cost of doing it is just not going to be practical.”
Update 2 pm ET 5-12-11: This post has been changed to correct information about stretched clustering that appeared in the original version.