olly - Fotolia


Simple rules to avoid VM snapshot problems

You can avoid a lot of trouble if you simply use virtual machine snapshots as they're intended. A VM snapshot is not a backup.

In the computer world we have backups and undos for applications and servers. However, a snapshot is neither of...

those. Many people confuse the purpose of snapshots and get themselves into a lot of trouble with them. To understand what a snapshot is, we have to define what it isn't.

Snapshots are not backups, period. There's no debate here. A backup is designed for long-term storage that can be reverted to in case of data loss or file corruption. An undo function is something familiar to many who have used Office applications. This feature allows you to step backwards in what you have done, allowing you to pick the exact spot where you would like to be. But a VM snapshot is not an undo function. So, now that we know what a snapshot isn't, what, exactly, is it?

A snapshot is a moment in time for a virtual machine. Snapshots are not normally automated events; they are manually created to make a moment in time that can be jumped back to. When this "moment in time" is created, the VM continues to run and the administrator can continue with the existing task, such as a software upgrade. If the upgrade fails, the administrator has the ability to revert the VM back to the moment in time before the upgrade. This sounds very similar to an undo feature, except for the fact that the only point that you can undo back to is to the one you created. There are typically two types of snapshots available:

Traditional snapshot. The traditional VM snapshot takes a moment-in-time snapshot that does not capture what is currently occurring in memory. This type of snapshot is often quickly executed, as it does not contain active items. The running VM is not suspended or paused during the snapshot creation. If the administrator chooses to revert the VM back to the snapshot, the VM will have to boot up as if it had been properly shut down.

Snapshot with memory. This type of snapshot captures the active memory of the VM at the time of the snapshot, as well as what is going on with other activities (disk, I/O, networking). This snapshot takes longer to execute, but it comes with an added bonus: If you revert back to this snapshot, the VM will go back to the exact moment when it was created. If the machine was running and performing tasks when the snapshot was created, it will pick up where it left off at the exact point where the snapshot was created.

What happens during a VM snapshot?

Both types of snapshots can be created while the VM is running, but only a static-state snapshot can be created if the VM is powered off. The two different snapshots do have different benefits and downsides. To understand those, it's key to understand what is happening when you take a snapshot. The snapshot process does not create a clone or copy of the existing VM. Rather, it freezes the existing VMDK (VMware virtual disk file) and creates a new file called a delta change file. Any changes that occur to the VM, regardless of whether it was a static or active snapshot, are now recorded in the delta change file rather than in the VMDK file (which is kept frozen in its current state).

Freezing the VMDK file allows the administrator to revert back to the snapshot they created and all changes in the delta change file are thrown out in favor of the frozen VMDK base file. If the administrator decides the upgrade has worked well, they can remove the snapshot point. During this process the changes in the delta change file are applied to the frozen VMDK file "catching it up" with the current active state of the VM in the background. Once the VM's VMDK file is caught up, the delta change file is removed and the VM continues to move on.

When things go wrong

Snapshots are a very powerful feature that can go very wrong if used incorrectly. Now that we know what they are and how they work, let's look a few things you shouldn't do with them.

The never-ending VM snapshot. Snapshots are moments in time that meant to be temporary. Now, the definition of temporary varies from person to person, and that is where the problem lies. The delta change file records all of the changes to the VM: If it's a busy VM, the delta file can grow to excessive sizes. The delta change file in a database server, for example, can grow to 60 GB in less than eight hours. If your data store has capacity, you might be OK, but what if you have an Active Directory server that had a snapshot taken a year ago (that someone forgot about) that has grown to 600 GB? The problem with these growing snapshots is that they can create a lack of space and run a data store out, which typically results in corruption of the VM.

Removing a large VM snapshot. Deleting a large or extended-time snapshot is a process in which all of the changes recorded in the delta change file have to be applied to the base VMDK file. For smaller delta change files, this process is relatively painless, but with large snapshots, the process can cause the running VM to pause or hiccup for seconds to minutes as the changes are applied to the base VMDK.

Nested snapshots. A benefit and curse of snapshots is the ability to nest them one after the other. This gives the administrator the chance to jump back to multiple points in time for the same VM – creating an experience more similar to an undo function. The problem is, the more snapshots you have, the more delta change files you have, and the greater your chance for corruption.

Snapshots are a wonderful tool that can be a lifesaver when you're performing upgrades or software maintenance. They give you the unique ability in the virtual environment of a manual undo that can save a bad upgrade. The key to snapshots being successful is to not abuse them. Here are a few key guidelines to remember to keep your snapshots in check:

  • Specify a defined window on how long they can exist. A four- or six-hour window should be plenty for most upgrades, and it sets expectations.
  • Ensure you leave 50 GB to 75 GB of drive space in your LUN for the growth of the delta files.
  • Run daily reports on which snapshots are currently active.
  • Create alarms to alert you of snapshots growing to excessive sizes.
  • Be very cautious when you use snapshots with Active Directory controllers. When you revert to a snapshot, it's taking the VM back in time for a moment until the clock updates. Active Directory controllers with Kerbos do not handle this time jump well.

By following these guidelines and always remembering the temporary aspect of snapshots, they can be a tremendous benefit to your organization.

Dig Deeper on VMware administration and how-tos