Although there may be many x86 virtualization platforms, such as VMware ESX, Xen, Microsoft Virtual Server 2005 R2 to name a few, monitoring is pretty much the same from one to the next. In this article, I will discuss what is involved in monitoring a virtual infrastructure, including monitoring the physical host servers, the virtual machine monitors (VMMs) / hypervisors, the virtual machines (VMs) and the applications running inside the VMs. I will conclude by looking at how to understand the performance metrics being gathered.
Physical host servers
Monitoring the physical host servers in a virtual infrastructure is extremely important. Because a single physical host server can host tens of VMs, it must remain healthy and without problems. My intent is not to scare people away from virtualizing their infrastructures but to make everyone aware of how important it is that you not forget about the physical hardware on which your virtual infrastructure resides.
The first place to look for server monitoring tools is the vendor. Dell offers tools such as OpenManage and IT Assistant, and HP offers its OpenView software. In many cases, the hardware vendors' monitoring solutions are the best choice for monitoring hardware, because these tools are obviously designed and supported by the same company that made the hardware.
But you'll also find no shortage of third-party solutions at your disposal. Both Dell and HP provide management packs that plug in to Microsoft Operations Manager (MOM). If money is an issue, check out Nagios Nagios is an open source monitoring program for hosts, services and networks. One of the environments I work in uses Nagios and I am quite pleased with the program's capabilities. Not only is Nagios free, but it gives many pay-for products a run for their money.
The process of monitoring physical hardware in a virtual infrastructure is nearly identical to that of monitoring the physical hardware in a traditional server infrastructure. But because of the tens of VMs that depend on them, maintaining the health of x86-based physical servers is more important than ever.
Virtual machine monitors / hypervisors
A lot of people ask me about the difference is between a VMM and a hypervisor. The answer is, "Nothing" A VMM does exactly what the name suggests; it monitors and manages virtual machines. The term "hypervisor" is a play on the name of another computing component, the kernel. When kernels were a new thing, they were known as "supervisors" because they supervised the machine; hence, the term hypervisor refers to a VMM that supervises many machines, albeit virtual.
Unlike the software that monitors the underlying hardware, the software that monitors the hypervisor depends on the type of hypervisor you are using. If you are using VMware ESX, you have several options. Just as with monitoring hardware, the best place to start looking for virtual monitoring solutions is the vendor. VMware includes a Web-based management/monitoring interface to ESX called the Management User Interface (MUI) that, in addition to managing ESX, can tell you the current utilization of the VMM.
The MUI has a very nice availability-reporting feature. From the console in ESX, you can enable another Web-based reporting tool called vmkusage. While the MUI requires the user to authenticate, vmkusage provides a read-only, anonymous view of the state of the ESX VMM. While you are logged into the console, you can also run a tool called esxtop. Esxtop is similar to the standard top command, but unlike the top command, esxtop will also show the real-time utilization of the different ESX worlds, including the VMM.
VMware also produces a separate management/monitoring solution called VirtualCenter. Although VirtualCenter does not provide any additional monitoring information, it does let you set up events and alarms that can notify you when certain lower and upper resource limits are exceeded. Of the third-party ESX monitoring solutions, just one stands out, NetIQ AppManager for VMware.
All of the monitoring solutions for Microsoft's Virtual Server 2005 R2 VMM come from Microsoft. You can use the standard Windows event logs to monitor the VMM, an approach already used by many Windows systems administrators. Virtual Server 2005 R2 also installs Windows performance counters that can track the utilization of the VMM. If you do not want to develop a custom utilization monitor with the Window Management Instrumentation (WMI), Microsoft Operations Manager (MOM) already leverages the Virtual Server 2005 R2 performance counters to provide a robust monitoring solution.
A few open source Xen monitoring solutions are worth mentioning. Libvirt is an open source toolkit designed to interact with Xen and other open source virtualization platforms. Also, Argo the Xen Monitor is a framework for managing and monitoring Xen. Commercial Xen solutions provide their own monitoring tools. XenSource's XenEnterprise has a monitoring solution that provides a real-time view of the VMM's performance. VirtualIron's Xen package comes with a management and monitoring solution called VirtualizationManager.
All current VMMs require some sort of host OS or privileged control OS. For VMware ESX and Xen, this is Linux, which means that the control OS can use native Linux monitoring tools to gauge the utilization and state of the VMM. A perfect example of the KISS methodology is the syslog daemon. You can configure the syslog process to copy its logs to a dedicated log server so that they are available in the event of a catastrophic failure. One of my favorite tools is a product called splunk. The creators of splunk had the amazingly genius yet simple idea that logs are more useful when they are compared with similar logs from around the world. The Unix/Linux system management tool monit can also be used to watch your VMM processes.
Think of the hypervisor as your brain. Your body (the VMs) can be perfectly healthy, but if your brain fails you, then your body does not know how to function. Even though hypervisors, like our brains, are designed to "just work," active monitoring is necessary to prevent possible total system shutdown.
The VMs are analogous to your old servers -- they are running software to fulfill a business purpose. Just because your servers are now virtual does not negate the need for adequate monitoring. Luckily, this is quite easy because the VMM monitoring solutions almost always have the capability to monitor the VMs. For a list of these solutions, please refer to the last section.
Monitoring applications running inside VMs is no different than monitoring applications running on a physical server -- the same software can be used and it is as necessary as ever. I have met too many IT professionals who are under the mistaken impression that an application hosted virtually is not subject to traditional stress and rigor. Although ideas about application monitoring should stay the same, ideas about application and system utilization must change, and that is the subject of the next and final section.
Understanding the performance metrics
After combing through all of the data gathered by the different pieces of monitoring software, we can see that, at any given time, the virtual infrastructure is seeing only a 37% utilization. The first response of many engineers and IT professionals is that this is a good thing; it means that the physical servers can grow to meet increased demand and handle the occasional resource spike. Unfortunately, although that way of thinking has suited most people quite well over the last decade, it no longer applies when dealing with a virtual infrastructure. The goal for a virtual infrastructure is to have around 80-85% utilization at all times.
I know my numbers seem high, almost ridiculous, but stop and think about it with me for a moment. One of the goals of implementing a virtual infrastructure is to consolidate the number of underutilized and overengineered physical servers in a given data center. Then why desire virtual servers that run at 20% utilization when the reason those virtual servers exist is first and foremost to help reduce costly underutilization? The answer is that it does not make sense; we have just been trained to think that it does. We must un-train ourselves away from this mindset and embrace rich utilization.
If some of you are still not sure, that's okay. Let's look at this way. Two reasons why people feel more comfortable with low system utilization (around 20-35%) are that the system will be able to handle spikes and that it will be able to scale as demand necessitates. With virtual machines, these problems go away. Spikes still exist, but depending on your virtualization platform, the hypervisor will detect that your VM needs more resources and should allocate unused resources from other VMs to the VM in need. Contrast this with two physical Web servers that are running at 35% utilization. One of the Web servers could see a prolonged but finite increase in memory utilization. It would be nice if the spiking Web server could borrow some memory from the other Web server that is sitting happily at 35%. With physical hardware, this is not possible. Virtualization enables you to fully exploit your hardware, and that makes good business sense.
In addition, if your service is successful, it is likely to need more resources in the future. Allocating more resources to a service generally entails increasing physical resource capacity on a single server or re-architecting a service into a cluster or farm. Either scenario involves significant process -- purchasing new hardware, installing it, and possibly having to install a new system. The bulk of this time is saved with virtualization. You simply either allocate more resources to a single VM or clone the VM to begin work on a cluster or farm. So the need to overallocate resources to physical hardware in order to negate costly upgrades is no longer a factor.
As you can see, in a virtual infrastructure, the argument for low system utilization falls apart. So what does this mean? Using the data gathered by the above monitoring solutions it is possible to gauge your virtual infrastructure's overall utilization. If the performance metrics show an average 45% utilization, you can still stand to increase load by 35-40%. But if the metrics show that the average utilization is between 80-85%, be happy, you are making the most of your hardware!
In conclusion, it is important to adequately monitor your virtual infrastructure to guarantee its health and ensure that you are not losing money due to underutilization.
About the author:
Andrew Kutz is deeply embedded in the dark, dangerous world of virtualization. Andrew is an avid fan of .NET, open source, Terminal Services, coding and comics. He is a Microsoft Certified Application Developer (MCAD) and Microsoft Certified Solutions Developer (MCSD). Andrew graduated from the University of Texas at Austin with a BA in Ancient History and Classical Civilization and currently lives in Austin, Tex. with his wife Mandy and their two puppies, Lucy and CJ.