In a perfect world, our applications and servers would run smoothly, day in and day out. There would be no conflicts...
or resource issues, and harmony in the data center would be visible to all. Unfortunately, we don't live in that world. All of the documentation and controls we put on our virtualized environments may help, but despite our best intentions, we're still faced with the issue of the rogue server and how it affects change. This problem is often the result of IT employees working against tight deadlines. Without a streamlined process for change, more than one person will have to get involved to get something done.
Though the purpose for doing this may be justified, it presents an unplanned change to the environment. A single change may not be enough to bring down an entire virtualized infrastructure, but a large quantity of changes or additions can. It can be frustrating for administrators who traditionally have ownership of key systems to learn that changes were made without their knowledge or permission, but remember, such changes are never made out of malice. As virtual admins, it's important for us to take a step back and remove our personal feelings from the equation; it may sound like therapy 101, but a clear head is critical to the overall process.
Digging through logs
The first step to correcting these changes is to investigate and figure out exactly what's going on. The source of the problem can be something as small as resource additions or changes, or as large as new VMs and sweeping configuration changes. The goal here is to pinpoint the change and start asking what, when and who. You absolutely cannot reverse the change(s) without proper investigation. Even if a system is in place illegally, removing it could compound the issue. As an IT person, one of your first duties is to keep things online, so as much as you might want to, you can't just pull the plug.
Figuring out what changes were made can be a little tricky, but if there's one thing computers are great at doing, it's logging. Sometimes the change will be immediately apparent, but, more often than not, admins have to dig deep into logs to find out what the change was and when it was made. Logs are also great because they can tell you who made the change, provided you follow the rule of never using a generic login for key systems and infrastructure. Changes leave footprints that you can trace. When going through logs, pay particular attention to the time and day that the change was made, as it can be beneficial when correlating it to the changes you're seeing.
To remove or to keep
Once you've determined the what, when and who, you can decide on a course of action to address the issue. Now, this can be a little tricky depending on the politics of who owns it. You can either remove or keep the rogue server, but both options have drawbacks. While removing the server seems easy, it's not. Consider this scenario: One of the higher-ups at your organization tells you to remove the server. You go through the proper channels to verify this and then proceed with the removal. Two weeks later, you get a call saying that something important was left on the server, and your higher-ups need it immediately. Of course, by this point it's completely gone, leaving you and your organization in a bind. If a server was created outside the normal channels, chances are not everyone has a clear picture of what's on it, so before removing anything, shut down the server and store its information on a separate Serial Advanced Technology Attachment disk. I'd also recommend waiting a few weeks between receiving the initial order and removing the server, just in case. Consider it a simple insurance policy.
As if removing a rogue server weren't difficult enough, keeping it is even more challenging. Even if the server was created from your base templates, you still have to do a health check. Does the server have proper security, backups and monitoring all configured? What about proper resource management, naming and addressing? All of these things need to be verified as soon as possible. Of course, you may run into a bit of a challenge if the system is online, but these steps need to be taken as quickly as possible. The other key component to this is the approval/request form. When it comes to a rogue server, it's unlikely that the person responsible for its creation ever completed the appropriate paperwork. In order to resolve the issue, you'll need them to fill out the proper approval/request form, even if the server in question is removed. This form creates an audit trail that could come in handy in the future should there ever be a security issue or outage due to the rogue server. Completing the paperwork after the fact can also be a wakeup call for infrastructure abusers and help them realize that they can't skirt the rules. After all, proper channels and rules exist for a reason -- they reduce the headache of cleaning up problems later.
Add visibility for data center monitoring with Cisco Tetration
Get organized with these data center tools
Prevent network breaches with data center visibility