Once you've determined that you are dealing with VM sprawl management, you can formulate a plan to eliminate it. While the initial response would be to delete unnecessary VMs, even a sprawled VM once served a purpose and may, in fact, still contain valuable data even if the resource itself is no longer in use. This presents the administrator with the unique challenge of removing the resource without deleting associated data.
One of the first ways to conduct VM sprawl management is to tag the sprawled VM so you know what you're working with. In our previous installment of this two-part series, we addressed identification: If you do not label what you have found, you are likely to lose it again. For this process, VMware tagging is your best option. While tagging helps to identify servers, it also can help provide a validation check. Tags don't need to simply contain a name or owner. By adding a validation date, you can ensure that the server in question doesn't become a relic that no one has checked on in years. Tagging will also help you find machines that don't belong -- however, this is where things can get a bit tricky.
Methods for VM sprawl management
Sprawled VMs fall into two categories: ones that should be removed and ones that can't be removed. Let's start with the former. When you run into a VM that was created and can be removed, there are a few steps you'll need to follow to ensure a successful removal. Always start by turning off the VM and then watch what happens. I'd recommend leaving it down for 30 days minimum before deleting it, though 90 days is ideal. This may seem like a lot of time, but it's better than having to rebuild or restore it a month later when someone decides they need it. Of course, the admin could always say "too bad" and remove it right away, but customer experience is critical, and the delete key is not customer friendly. With today's budget cuts and outsourcing, IT can't afford to be on the wrong end of bad service. Keeping a VM around for 90 days isn't really an issue with CPU and memory constraints, so long as it is turned off and guaranteed that, in a high availability event, it will not start up.
The issue is the storage: A powered-off VM is dead weight on your expensive shared storage. An easy fix would be to migrate the VM to a lower tier storage. If you want to preserve space on your lower tier storage, you can always use a local drive on the host for general storage. Local server drives are fairly cheap and can normally be found in abundance in most IT shops. It's not an ideal fix, as you have limited redundancy, but the data we are targeting is slated to be removed anyway.
The more complex challenge in VM sprawl management is a VM that has become a necessary resource and can no longer be given up. This is more than a simple technical issue, it is political and a problem many people are not familiar with. This doesn't mean IT folks have to understand the political landscape, but rather that they have to know how to navigate it enough to make it to the other side. One of the reasons people bypass the normal procedures in requesting machines is to avoid paperwork. If the VM in question is truly necessary, the paperwork should still be completed. If the pain of paperwork doesn't go away when people circumvent the rules, then what is the benefit? While this helps to ensure compliance, it tends to act more as a deterrent for people to skirt the rules in the future.
Once you've prepared the correct paperwork, the VM has to be evaluated to confirm it was properly sized. While this would normally be done as it was requested, it now has to be done after the fact and, as a result, the process changes. Originally, the requester could specify resources, so long as they were reasonable. Now, the admin has to evaluate how the machine is using the resources to make sure it actually needs what it has been assigned. If the VM is not using resources, the resources should be lowered to a level that is more acceptable. This process change isn't mandatory, though it is critical to VM sprawl control. Very few things will scare an application owner more than the words "reducing resources." This should hopefully be a fair, but strong enough deterrent for most application owners, as they will be worried you may look into the other machines they are running.
Sprawled VMs have come along with the free server mentality that many people now see and embrace. The infrastructure resources are normally taken care of by someone else, so for application owners, the sky is the limit. Bringing perception and expectations back to earth is a challenge, but if we don't do it, we will leave behind a trail of abandoned VMs that will become our anchor as we strive to be flexible, agile and cost effective.
Test your VM sprawl knowledge
Coping with virtual server sprawl
Where does VM sprawl come from?