In the aftermath of the infamous bug in the latest release of VMware ESX, VMware CEO Paul Maritz has released a letter that apologizes for the incident and also explains what went wrong and how they are committed to ensure it never happens again.
For customers who were effected by the widespread problem with ESX 3.5 Update 2 released several weeks ago, is VMware’s apology and promise to improve their processes enough? Or is it going to leave some lingering doubt in the minds of some that may inspire them to look at other virtualization products?
The letter provided an explaination of what what happened:
The issue was caused by a piece of code that was mistakenly left enabled for the final release of Update 2. This piece of code was left over from the pre-release versions of Update 2 and was designed to ensure that customers are running on the supported generally available version of Update 2.
And why it happened:
I am sure you’re wondering how this could happen. We failed in two areas:
- Not disabling the code in the final release of Update 2; and
- Not catching it in our quality assurance process.
And finally what they will do to ensure it never happens again:
We are doing everything in our power to make sure this doesn’t happen again. VMware prides itself on the quality and reliability of our products, and this incident has prompted a thorough self-examination of how we create and deliver products to our customers. We have kicked off a comprehensive, in-depth review of our QA and release processes, and will quickly make the needed changes.
Despite it all, VMware still has a great enterprise product that is robust and mature and is still the virtualization software of choice for most Fortune 500 companies. This incident still could have easily been prevented by following processes when preparing a beta build to become a final build. In addition, their QA processes which are usually designed to ensure a quality product also failed to detect that the time bomb code was still present and active.
Will VMware learn from this incident? Absolutely. Sometimes it takes a big event like this to inspire changes and improvements in a company that may have been set in its ways and wasn’t paying attention to details.
One area that many users were critical of was VMware’s communication on the matter. They were initially slow to issue public communications and proactively contact customers to let them know about the issue. The thread in the VMware Technology Network (VMTN) forums that was started on this issue became the rallying point for many of the users who were experiencing problems as a result of the bug. VMware employees did provide some updates to the thread which let users know they were aware of the bug but did not provide much other information until much later in the day. Another breakdown was that VMware’s knowledgebase that had information on the bug and is often the first place users go to when experiencing a problem becamse so overwhelmed by the number of requests that it was unavailable for over 6 hours.
VMware delivered the fix for the problem fairly quickly as it was available roughly 24 hours after the problem was first reported. Many users were hoping to get it quicker then that, but VMware needed time to package and test the fix before releasing it. VMware also did provide good communication later in the day with detailed updates and emails that were sent to customers.
So is VMware’s apology enough? In my mind it is. Yes, it was an unfortunate incident that caused many customers a good deal of grief but the end result is that VMware responded quickly and effectively and this incident will serve as a lesson that they won’t soon forget and will help make their products and processes stronger going forward.