Private cloud performance monitoring is crucial, not only to troubleshoot problems, but also to ensure that service levels are being met as services are centralized. To be successful,
Constantly collect data for private cloud performance monitoring
Private clouds are about process, automation, people and centralization. Some private clouds use virtualization as well as physical hosts, so no matter which tools you select for performance monitoring, you should gather data from all of your hosts. Collect data all the time, not just when you are consolidating, centralizing or troubleshooting.
Often, customers and monitoring systems don’t notice a problem when it starts. They only notice when it gets bad enough for users to complain. With historical data, you can see when a problem started. Perhaps that CPU load problem you were just notified about started a week ago with an antivirus scanner update. You would be able to see that easily in your historical data, and the people working on the problem could find it quickly, fix it and return to more productive work.
Private cloud performance monitoring has additional nontechnical benefits. Services you want to centralize, such as a series of departmental Web servers, often don’t have much monitoring in place. When a server is “down” or “slow,” someone just walks over to it and reboots it. And that's the wrong thing to do.
If you promote a centralized service by saying that it is monitored for both availability and performance, you make it harder for departments to resist. After all, you’re doing it right, whereas they were not.
Transparency is important, too. Make private cloud performance data available to developers and application administrators so they can see the effects of their configuration choices. For a cloud based on virtualized infrastructure, such choices might be good for an application but bad for the environment in general. Everything in IT is a tradeoff, including performance. An app’s performance goals should be well-documented so they can be met but not exceeded. Exceeding these goals would require additional expenditures of money and time.
Choose relevant data points for private cloud performance monitoring
When implementing a private cloud performance monitoring system, gather data on as many relevant metrics as possible and from the right places. Don’t ask a guest OS in a virtual environment what the CPU load is—it won’t know the right answer. You can get that data accurately from the virtualization platform. The same is true for memory usage, network I/O, storage I/O and so on.
By contrast, application performance is best measured at the individual server level, which will help you see things like a cluster member about to be overloaded.
In addition, gather data at the highest resolution you can afford. Many performance monitoring tools show historical data in five-, 15- and 60-minute averages, which smoothes the peaks in load for the graphs. That smoothing is deceptive, because the spikes in load are very important.
When an application goes to do work, it doesn’t do it slowly. It uses all of the CPU it has available and gets that work done as fast as it can, which appears as a 100% CPU spike on a graph. The width of that spike is important because it often represents how slow an application feels to a user—in other words, the latency between a request and the result.
If the performance monitoring software averages those spikes with idle time, you might see the server as 50% loaded and arrive at the false conclusion that it has enough capacity. Network and storage connections work the same way. If the link is 100% busy for a minute, and 0% busy the next, the average is 50% busy, which may not seem like a problem. Digging deeper by using a higher-resolution graph is very useful in these situations. Of course, keeping a lot of data and collecting higher-resolution data also consumes CPU, memory, network and storage resources, so you want to strike a balance.
This was first published in July 2012