Bad luck sometimes follows you. Then again, I don’t believe in luck, or its evil twin Murphy’s Law.
Still, bad stuff happens, and in IT, you’re usually there to see it. This happened at an undisclosed location near a super secret military installation next to a classified roadside diner in an unidentified town west of an unnamed river. In other words, my office, the place where I’m both the head VMware and Citrix person and the IT Director (which may give me multiple personality disorder someday, but at least I won’t rust out!).
Arriving at 8:15AM, as I’m wont to do in order to get a jump start on the day, I get an angry mob assaulting me with torches and pitchforks before I’m even to my office. That usally means a big problem, and in this case, nothing’s running. No problem, probably just a power outage knocking things around again. We have frequent outages here, and since we’re not a 24×7 shop, sometimes the UPSes run out of juice and I sometimes have to restart things by hand. Normally everything shuts down gracefully, but on this occassion, the whole place was a mess.
A quick check of server health and I see that all of my physical boxes, including my SAN and VI hosts, are up and running and have been all night. I log into VC to see exactly what I had expected: lots of powered-off servers. Now had there been an event, these servers should have VMotioned off to other servers, and failing that, come back up automatically in a set order. They didn’t. That’s usually a SAN-related issue.
Ok, I check the SAN and see that it’s fine now, but showing uptime since a little past midnight. It’s looking more like the power outage was long enough to take down the UPS that the SAN was on, but not the VMware servers, which shouldn’t be the case unless there are battery problems I’m not being alerted to by APC’s software. Guess what? Yep, that’s it. Ok, lets restart some machines – Linux servers all come up wonderfully. Windows servers, not so much on some of them. The blue screens of death are visible as far as the eye can see — so blue, I thought I was in the Caribbean, except for the stress and lack of rum-based drinks.
Most of them are reporting disk and kernel related problems. Most of the error messages relate to a missing %WINDIR%\SYSTEM32\CONFIG\SYSTEM file. Another common one reports that NTOSKERNEL.EXE is missing. Great, that’s a huge part the registry and the system kernel. This is gonna take days to fix if I have to pull backups and restore from them. Well, maybe not if I’m lucky and its just some corrupted space on the virtual disks.
Treating them like physcial machines, the next step is to boot the recovery console from CD (in this case an ISO file) and run chkdsk with the /p and /f switches as a first step to troubleshooting. Except of course, that there’s no hard disk to be found by the Windows installer, which cause a brief, although painful heart flutter at the though of pulling from backup. It’s one thing to do a quarterly test, it’s another when it’s real. Successful tests or not, massive documentation and howtos or not, the worst starts to flash through your head. You question your methodology: no matter how thorough you tried to be, you begin to think that maybe the test methods were somehow flawed and the backups will fail. Ok, the problem at hand is that I can’t get the disks to be seen by the Windows installer CD. Time to focus and forget doubt.
And this is where it becomes about virtualization, in case you thought I was going off-topic.
Fear pushed aside, it’s time to look at this from a hardware point of view, but also to remember that the hardware is all virtual. A common pattern emerges: the machines in trouble are all converted machines from an existing VMware Server 1.x install that we P2Ved some time ago. It wasn’t something I noticed right away, as a few of those machines were never P2V-ed, some were P2Ved by hand (i.e., just rebuilt), and our Linux VMs from that same VS box are all fine.
The common denominator is that all of the problem boxes had the Bus Logic SCSI controller. None of them would see the virtual hard drives. Switching over to the LSI Logic controller and accepting the change allowed me to run the recovery console, as the Windows installer saw the disks. It was a quick fix: from the recovery console, I ran the disk check and recovered with no further steps needed.
So, I get the Windows boxes back up, curse Murphy for his Law, and vow that in all future conversions, I will change over the controller before the converted box goes into production. Oh, that and I will have some people in to look at the UPS situation. Now, where are those rum-based drinks?