Effectively communicate and troubleshoot an upgrade failure

When you're dealing with an upgrade failure, it's extremely important to not only communicate with end users and management, but also to understand the limitations of backups.

In an ideal situation, computer systems and applications would always work as intended. A conflict or upgrade failure wouldn't occur and patches would work smoothly and never cause any issues. While that's the goal, we have to face facts and realize that, unfortunately, it's just not the case today. Not everything is doom and gloom, though; with some tips and best practices to help you survive an upgrade failure, there is hope. A key thing to remember about upgrades is that they aren't patches. This might be common sense to IT people, but end users and management don't always understand the difference. While both upgrades and patches can cause issues, often the upgrade takes much longer, involves more extensive changes and has higher labor and software costs. Upgrades also carry more risk.

The importance of communication

Communication is the first and most important step of an upgrade. The more informed your organization is ahead of time, the easier it becomes to handle an upgrade failure. Now, the key is to not overwhelm users and management with all of the technical details, as that will cause blank stare syndrome. Rather than outlining the details of the technical piece, ensure they understand the scope of what will be done from their point of view. Going over the steps needed to upgrade data center hardware isn't the same as telling them that the main application(s) they use will be offline. When you draft your communication to users and executives, put yourself in their shoes and frame the explanation for them. As the old motto goes, always remember your audience, even in IT.

In the best case scenario, your backup is used to restore a missing or corrupt file. In the worst case scenario, you have to restore an entire server.

The second critical piece of communication is having a point person during the upgrade; this should not be the person doing the upgrade itself. When an upgrade fails, the person doing the upgrade has one job: to get the system back online. If they have to stop what they're doing while troubleshooting the upgrade to give status reports, then that takes away valuable time and concentration from the troubleshooting process. Pick a person on the team that's in the area, so she can hear what's going on. She doesn't need to keep asking the engineer working on the issue what's going on; they will get that from the conversations between team members, and she can then filter and compress the information for management.

Having and using backups

One of the topics that often comes up during an upgrade, and more specifically an upgrade failure, is backups. The line often heard during meetings and preparations is, "In the event the upgrade fails, we have a backup." However, what does it take to use that backup? Having the backup and being able to use it are two very different things, and something that isn't always accounted for. In the best case scenario, your backup is used to restore a missing or corrupt file. In the worst case scenario, you have to restore an entire server. This may require you to build enough of it up to get a working backup agent back on it before you can even attempt to restore the data. While the backup is valuable, and you absolutely should use one, it's not a magic solution to address an upgrade gone bad. It's a tool that you can use to help reverse the change, and its limitations and timeframe need to be communicated and tested ahead of time. Now, VMs and snapshots are a little different, as they can truly revert an entire server. This gives you the magic undo button, but you have a limited window to use it, otherwise you risk the snapshot growing out of control. Additionally, certain servers, such as databases and Active Directory controllers, can have issues with data corruption, flexible single master operation rollbacks and other timing issues, so caution must be exercised.

Troubleshooting a failed upgrade

Once your communication plan is set and your backups are ready to go, you have to look at the scope of the change as you start to troubleshoot a failed upgrade. One of the first questions to ask is: Was yours the only upgrade occurring? While this sounds simple, multiple upgrades at once happen a lot more than you'd expect. Once a server is tagged for an upgrade, there is a set outage window, so someone may decide to add changes to it. This might be something as simple as patches or as extensive as a database scheme change. While the scope of change should have included the additional pieces, these things often occur under the radar when everything's going smoothly. Finding out what changed in addition to what was scheduled can truly be a detective process, but in this case the logs can shed a lot of light on what happened.

Ideally, an upgrade failure would never occur, but in reality it's a part of life for IT. The key to it is to do your research beforehand and prepare a solid communication plan that addresses all stages of the process -- before, during and after. Have a solid recovery plan that includes both recovery point objective and recovery time objective estimates for different levels of failures. Ensure that the scope of your change doesn't get by you or anyone else. And above all, communicate -- it's worth mentioning twice. No one, especially management, likes to get blindsided. Issues with upgrades happen; it's how you deal with them that's the key.

Next Steps

Upgrade from Windows 8 to Windows 10

Streamline data backup and disaster recovery

Upgrade from PowerShell 4 to PowerShell 5

Dig Deeper on Downtime and data loss in virtualized environments