Appropriately sizing virtual machines can be a difficult process with many unknowns. Allocating too few resources can starve a VM and lead to poor performance. Administrators wary of this potential problem may take the safer approach and allocate more resources than a VM needs. However, this overprovisioning wastes resources that other VMs could use.
Capacity planning tools can help organizations identify consolidation opportunities, allowing them to scale back overprovisioned VMs and save money. In this interview, we talk with Andrew Hillier, the CTO and co-founder of CiRBA, about how big of a problem overprovisioning is, how to minimize risks associated with higher consolidation and how IT pros can convince applications owners to do more with less.
Why is overprovisioning a problem and what's behind it?
Andrew Hillier: Overprovisioning is a huge and very pervasive problem, and I think it's because it is one of the only ways people have to manage risk in their IT environment. If you have an unknown -- you don't know what your application is going to do or you don't know exactly what you'll need -- overprovisioning is the traditional way to go about it. In virtual and cloud environments, it just keeps on propagating. In virtual environments if you have a performance problem, you can just throw more hardware at it and that's the default way around rather than digging deeper. In clouds, people buy cloud instances because they don't know what they need. Sometimes it's the most prudent way to go for someone, but we're getting to the point that this isn't something we should tolerate. There are ways to fix it that don't cost a whole lot of money. In the past, maybe it was necessary but now it's not.
Isn't it safer to overprovision than risk slow performance or an outage?
Hillier: We like to use an analogy to a game of Tetris. Workloads come in different shapes and sizes and when you add them together, it starts to jumble up to the point where servers look like they're full. But, when you play Tetris more cleverly and move those blocks around, you can defrag capacity and get a lot more out of it. Sometimes people are doing all the right things with the tools they have at their disposal, but they can't fight this because they don't have anything that can help them play Tetris better. I wouldn't characterize overprovisioning as people doing anything wrong, it's just that they don't have the analytics at their disposal to fix it. So, we see a lot of people buying more hardware before they really need to. If you analyze things more carefully you can go farther with what you have and not increase risk, just by sorting things out so they don't collide.
How big of a problem is overprovisioning?
Hillier: We see a lot of that firsthand. If you look at the density of a virtual environment most, organizations stall right around two-thirds full. That's when you start seeing workloads conflicting with each other. In studies, we've seen you can increase workload density on average by 48%. So, if I have an environment that's running a bunch of VMs, I can put almost half as much in there again, if I'm smart about it.
How much are companies leaving on the table, and what kind of incentive is there to pull back?
Hillier: If I talk to someone who's responsible for 10 servers, and I say, 'Did you know you can run everything you're doing on six or seven of those servers,' they might yawn. To them, having three extra servers around isn't a big deal. They may be more concerned about avoiding operational issues or making sure their pager doesn't go off in the middle of the night.
If I talk to someone who has 1,000 servers and I tell them they can run on 700 instead, that's quite a bit of money. If I go to a CIO and say, 'You have 10 data centers and you could fit everything in seven of them,' that's a huge difference and a huge cost savings. If you look at an entire enterprise though this lens, there's an astounding amount of saving that can be had. It can get to hundreds of millions of dollars. It can also cut down software costs. And it's not just about hardware. When you have that many extra instances, usually they're running multiple licenses.
Limiting overprovisioning will get you higher levels of consolidation, but consolidation comes with challenges. How do you balance the cost of adding hardware with the risk of a hardware failure potentially taking out more workloads in a consolidated environment?
Hillier: It all comes down to how you define overprovisioning. If I'm running a critical production environment, I might want two servers totally empty for failover purposes. I might want a bunch of capacity sitting idle for disaster recovery purposes. I might not want my servers going about half capacity for safety reasons. The way you approach that is to define your operational parameters, including safety margins and dependencies, and that defines when capacity is full -- not whether CPU use is at 100%. It really comes down to properly capturing operational policies, which means defining what spare capacity you want to have. Then, everything beyond that is a waste.
How do you address the challenge of trying to convince application owners that they can live with fewer resources?
Hillier: That's a great question. What we find is there's a huge appetite up the stack for this type of visibility. But, it really comes down to who's paying for it. If you have a line of business that is running applications on central IT infrastructure and they aren't paying with some type of chargeback model, they might be hard pressed to give up some of those resources because they're not paying for them. If IT is footing the bill, they care about the density. If you're a cloud or chargeback customer, you care about what you're paying. So, it's a discussion that would go differently depending on who's footing the bill.
We see organizations where IT is footing the bill and still getting lines of business to tighten things up a bit. The way they do that is to address new deployments. If I'm IT, when you ask for new capacity, I'm not going to give it to you if you're wasting the capacity you have. Of course, it's not always quite that simple, but that's the type of leverage IT has.