Because of a bug in ESX 3.5 Update 2, users who shut down their machines Aug. 11 received an error message that read, "A General System error occurred: Internal error" when they tried to re-start their virtual machines (VMs) or VMotion VMs running on ESX 3.5 Update 2 servers on Tuesday, Aug. 12. Virtual machines that were already running remained unaffected.
VMware also pulled the ESX 3.5 Update 2 bits from the download pages so that no additional customers could download the broken build.
VMware advised users not to install ESX 3.5 U2 if it had been downloaded prior to Aug. 12, 2008. To work around the issue, VMware suggested setting the host time to a date prior to Aug. 12.
Unfortunately, VMware's workaround probably didn't work for most production environments.
"This workaround has a number of very serious side effects that could threaten production environments. Any VMs that sync time with an ESX host and serve time-sensitive applications will be broken. These include, but are not limited to, database servers, mail servers, and domain administration systems," VMware reported on the Knowledge Base site.
ESX admin fallout
The ESX 3.5 Update 2 bug may have affected a great deal more VMware shops than if the bug had occurred in an earlier version of ESX, as it was the basis for the newly free ESXi, the price of which was reduced last month..
And those users made their grievances known, complaining about the bug on forums like the ARS Technica Server Room.
One blogger wrote, "It's pretty bad all around. I would hate to be in an environment with super-strict change management right now."
Another blogger on the forum wrote, "All of my hosts are in a production DRS [Distributed Resource Scheduler] cluster, and thus all of the hosts are populated with some number of guests. I am going to have to down at least one full host's worth of guests to apply this patch. I would guess that this is the situation most ESX admins will face."
On the blog, users who implement VMware upgrades in their test-and-development environments for a few weeks before moving it into production thanked their lucky stars, and some users sang new praises for Microsoft Hyper-V.
Dan Buchanan, a senior Microsoft engineer at a major global financial services provider and a longtime VMware user, said VMware's slow reaction to the bug is unacceptable.
"It took VMware up until 1 p.m. [on Tuesday, Aug. 12.] to post an official statement on the issue, and they still have not reached out to their customers," Buchanan said in the afternoon. [Editors' note: Later in the day, VMware CEO Paul Maritz did in fact issue an apology about the bug on a VMware blog. ] "They said they expect to have the issue resolved within 36 hours. That is unacceptable to users."
Turning back the system clock to a date before Aug. 12 proved tricky for Buchanan, because many of the systems at the financial institution are time-sensitive.
"We had to quickly change the date to Aug. 10, get the VMs running again and work as fast as we could to change the time to Aug. 12 again. Once the VMs are running, it works fine. It's when we shut down that is the issue," Buchanan said.
Luckily, Buchanan implemented the update only in his test environment, where he expected to let it "bake in" for a few weeks. Now he may not implement the update at all.
"I won't move the update into production unless they prove it is solid. If they fix it, I won't move it into production for 90 days," Buchanan said, adding that if he had put the update into his production environment, he would have been fired.
Released just last month, some features of Update 2 include support for Windows 2008 and Solaris as guest operating systems, and support for additional hardware like 8 Gb Fibre Channel and 10 Gb iSCSI initiators. It also includes support for full server Health Status in ESX and ESXi; Red Hat Enterprise Linux (RHEL) 3.0 U9; live cloning of VMs; and enhancements to Virtual Center alarms.