Prepare and prevent: Virtual recovery and backup strategies
A comprehensive collection of articles, videos and more, hand-picked by our editors
Storage and virtual machine snapshot blunders can derail a virtual infrastructure. Normally, snapshots can undo problems that arise in a virtual environment. But, if you’re not careful, storage and virtual machine snapshots can cause issues of their own.
More on storage and virtual machine snapshots
Virtual machine and VMware snapshot guide
How Hyper-V snapshots work
Creating snapshots in Xen with Linux commands
As long as you’re vigilant and know what to look for, these common storage and virtual machine snapshot pitfalls won’t be a challenge at all.
Virtual machine snapshot difficulties
The biggest danger at the virtualization level is that there are deltas that grow over time. If left unchecked, they will eventually fill the data store where they are located.
To stay ahead of the growth, savvy virtualization admins use a large number of “health check” style scripts that run on a daily or weekly basis and build a report to list which VMs use snapshots, when they were engaged and how big they are.
But what if a VM snapshot grows to 40 GB in size and needs to be deleted and merged into the virtual disks? Doing that amount of IOPS with multiple machines could easily saturate the storage layer leading to disk time-outs and potentially result in a loss of data. Also, there is a palpable hit on performance as virtualization-level snapshot delta files grow. How large that performance hit will be is based on a gazillion variables best summed up in the phrase “it depends.”
VMware has supported per-VM snapshots for some years. They support multiple snapshots on a single VM, which creates a parent-and-child relationship. The “go to” button allows the VMware admin to roll back the VM to a previous state. The “delete” buttons merge the snapshot contents into the virtual disk and then remove the snapshot on completion.
When you revert a VM snapshot, all changes will be lost unless you take another snapshot. So there is the chance of data loss unless virtualization admins prepare for the possibility ahead of time.
There’s also an assumption that going back in time will fix all the VM's ills. That’s quite a big assumption because who knows exactly when the problem occurred? In the worst-case scenario, you might find that the VM is in a worse state after the “revert” than it was before. Snapshots have no understanding of the problem you are trying to undo—a situation best described with the phrase “garbage in, garbage out.”
The bottom line is that VM snapshots should be used sparingly and, in some cases, limited to test and development environments or as part of a virtual disk backup strategy.
Common storage snapshot issues
Storage-level snapshots can suffer from similar issues. But, because they have been around for much longer than VM snapshots, storage-level snapshots are much more functional and robust.
In general, storage vendors have enforced their policies and settings upon administrators to avoid worst-case scenarios. The downside of most snapshots from storage vendors is that they “snap” an entire volume or LUN and don’t have the same granularity of VM snapshots. That said, it probably won’t be long before storage vendors also offer some level of control for their systems.
So how much space should be reserved? Most vendors do allow reserve to be adjusted at any time, but most storage admins would like to set-and-forget this reservation and not have to revisit the issue. Different storage vendors appear to have different opinions on what the default snapshot reserve should be.
For example, when creating a new volume in NetApp’s System Manager product (see Figure 1), the default percentage reservation of space for snapshots is 20%. In contrast, the reserve for Dell’s EqualLogic storage product is 100%.
Support for thin provisioning
It’s worth noting that in both cases the vendors support the thin provisioning of the volume if required. By default both vendors reserve the disk volume and snapshot on disk, and then it’s up to the storage admin to decide if they want to enable thin provisioning. Presumably, this is a protection for those who unthinkingly use thin-provisioned volumes without considering the consequences.
The NetApp product and the Dell product represent two different approaches to managing the space that snapshots consume. Reserving 100% of disk space for snapshot could be seen by some as being pessimistic—the concern is that a volume could potentially have every block within changed.
The impact of this can vary massively, depending on the block size that each storage vendor uses. Generally, the larger the block’s size, the easier it can be for the array to absorb the larger volume of expected churn.
Much depends on how you have laid out the virtual disks of your VM. If you exclude temporary files, such as the hypervisor swapfile and the guest operating system, you can reduce churn significantly. By including those temporary files as part of the snapshot, you could see an increase to additional amount of 10%. This sort of optimization can improve performance and reduce the amount you replicate to another site, but it does offer an additional level of complexity.
So storage and virtual machine snapshots can fluctuate quite a bit. Even within storage-level snapshots, the default settings surrounding them vary. In fairness to vendors, it is difficult to set a good default because they have no idea what each admin’s change rate might be. To that end, the snapshot itself can be a good indication of the rate of “churn” and can assist admins in decisions about snapshot reservations.
The important thing is that although storage and virtual machine snapshots can be incredibly useful, they aren’t particularly granular. They are excellent if you have large amounts of data that you need to roll back, but they are a bit of a blunt stick when used for more conventional backup purposes such as restoring a 1K file that was deleted by a user. So relying on storage and virtual machine snapshots alone for recovery purposes can be a mistake.