Although backing up VMs is usually a straightforward and reliable process, things sometimes go wrong. Here are...
some of the most common problems that can occur with VM backups.
The backup target unexpectedly runs out of space
It's easy to scoff at the idea of a backup target running out of space. After all, a good backup administrator meticulously monitors available storage to ensure this problem doesn't happen. Even so, backup targets can unexpectedly run out of space.
As with any other type of backup, when a VM backup runs out of space, it's usually the result of unexpected data growth. This problem tends to be somewhat more prevalent in virtualized environments because VMs are highly dynamic. An administrator might, for example, live-migrate several VMs to a different host, thereby changing the volume of data that resides on the host. Likewise, an admin might create a collection of new VMs without stopping to think about the effect the new VMs will have on the backup.
Low disk space within a host or VM
While it's fairly obvious as to why running low on disk space on a backup target would cause problems when backing up VMs, dwindling storage space on a specific virtualization host or even within a VM can also cause problems for backups. There are a few different reasons for this, but the main reason has to do with the way Volume Shadow Copy Services (VSS) works.
By default, VSS has 120 seconds to prepare a shadow copy. If this process doesn't complete within the designated time, VSS will generate a flush writes timeout error (0x80042313). If a system is running low on available disk space, then it can cause the New Technology File System to run much more slowly. Solid-state drives also slow down as they begin to fill up. These factors can cause VSS to timeout before a shadow copy can be created -- never mind that a disk that's nearly full might lack the capacity required by the VSS buffer.
Low memory within a VM
The Windows OS is designed to use a pagefile to overcome physical memory shortages. If memory begins to run low, then memory pages are swapped between memory and disk. Excessive swapping leads to a condition called thrashing, in which the pagefile is being bombarded with swap requests, and the machine runs very slowly as a result. Incidentally, this condition isn't unique to VMs. It can happen to physical machines, as well.
As previously noted, VSS has a limited amount of time within which to create a snapshot. Excessive virtual memory paging can slow the VM to the point that the VSS snapshot process isn't able to complete within the allotted time. It's worth noting that excessive swapping isn't the only potential cause of this condition. Any excessive I/O load could potentially cause problems for VSS.
When backing up VMs or restoring them, problems can also occur as a result of the complex interdependencies that so often exist in a virtualized environment. A multi-tier application might depend on Active Directory, domain name system and various database servers. The only way to adequately protect such an application is to back up all of the dependency resources.
The thing that can make this tricky is dependency resources can be scattered among hosts, and they might even reside in the public cloud or within nested virtualized environments that the backup can't see. This is why recovery testing is so important. Testing is the only way to know for sure if a complex application is properly protected.
VMs are omitted from the backup
Another common problem with backing up VMs is, sometimes, not all of the VMs residing on a particular host are backed up. This particular problem might occur if the organization is performing guest-level backups and neglects to add newly created VMs to a backup job. The problem can also occur if you perform host-level backups, but the backup software isn't configured to automatically detect and protect new VMs.
VMs fail during backup
Although I've never seen this problem happen, I've heard stories of Hyper-V VMs failing during backup on high-density hosts. The reason why this happens is because there are some Hyper-V VMs that have to be briefly placed into a saved state in order to be backed up. This is especially true for VMs that are running an OS that doesn't support Hyper-V Integration Services.
If a neighboring VM is configured to use dynamic memory, then it's theoretically possible for such a VM to claim some of the VM's memory while it's in a saved state. When the VM is taken out of the saved state, there might not be enough memory space available for the VM to start.