Wouldn’t it be great if a virtual machine (VM) was back online in only a few minutes after a major failure?
A quick restore time is certainly ideal for VM recovery, but it’s not always possible. For faster restores, you need a strong VM backup method
Many administrators want complete VM recovery within minutes, whether they have small servers or those with terabytes of data. But the file-copy process takes a ridiculously long time when data size approaches the terabyte range. Obviously, a VM recovery method that has to wait on file to copy from the VM backup service wouldn’t work with a terabyte-sized server.
To solve this problem and get a server back online in minutes, there are two potential approaches. Both methods should take only a few minutes, but one trades slightly more downtime for less up-front data storage waste.
VM backup as a failover method
The first approach essentially treats your VM backup method as a server failover solution. Many disk-based backup services use a file system filter driver to gather data from the backed-up server. Rather than looking at changes to individual files, a file system filter driver watches for changes to individual disk blocks. When a disk block changes, its new contents (a relatively small amount of space) are then copied to the backup server and cataloged with others.
Such a file system filter driver could be installed inside the VM, or even to the virtual host to monitor for changes. Disk blocks are backed up as they’re modified, so this approach to VM recovery no longer relies on a VM backup window. Rather than gathering the interim data -- the changes that happened since the last VM backup -- this driver just keeps updating the VM backup in almost real-time.
With changed blocks cataloged to a backup server, it’s feasible to restore them to a second powered-off VM at the same time. What results is a kind of data stream between the two servers. The process starts when a disk block changes on the production server. Then, the file system filter driver captures the change and sends it to the backup server. Finally, the backup server transfers the change to the redundant server.
These two servers are therefore loosely synchronized, which means that VM recovery after a failure requires little more than powering on the VM’s other half. That other half could be a physical server somewhere, or it could be another VM. Synchronization occurs in both directions, so returning to the primary server simply requires reversing the flow of data, powering down the secondary server and powering on the primary.
This failover approach is a great starting point for getting servers operational quickly, but it does require an extra copy of data waiting around for a failure. If the server you’re protecting is a terabyte in size, the cost of keeping two copies can be excessive.
Data prioritization for VM recovery
There’s another approach to VM recovery that can get servers back online in minutes, but without costly data duplication. This VM backup method uses data prioritization during the VM recovery and restore process.
Using the same file system filter driver, this VM backup method keeps only a single copy of data within the disks attached to the backup server. Should a production server fail, the first step to VM recovery is booting a replacement with a DVD or other media. On that DVD is enough of an operating system and associated application code to kick off the VM backup. You can then install the core OS and critical applications.
The quantity of data consumed by the OS and applications is relatively small compared to the databases. (Think “tens of gigs” as opposed to “thousands of gigs.”) So restoring the VM doesn’t take that much time -- perhaps only a few minutes. Once the core pieces of that server are operational, the larger portion of the data then begins its restore.
It’s at this point in the VM recovery process that things get interesting. A disk-based backup service that uses a file system filter driver can randomly access any piece of data in its catalog. That data can be prioritized according to what users need. Once the application is restored and ready for use, high-priority data can be restored before others -- as users request it. The complete set of data might not get restored for a long time -- say, the amount of time required to copy that terabyte of data -- but the server and its applications are functioning within minutes. If a user needs a piece of data that isn’t already restored, that piece can be immediately prioritized.
Various vendors offer VM backup and restore services that use both of these approaches. You’ll find that some tools can get terabyte servers back online faster than others. With the combination of disk-based backups and a file system filter driver method, all you need is a bit of extra management code at the backup server to accomplish these VM recovery tasks.
This was first published in February 2011