Many people who express interest in hyper-converged infrastructure also worry about how to troubleshoot the stack...
and apportion tickets to the right vendor, be it hardware or software. Rather than meet problems head on, vendors tend to blame each other; it doesn't help that issues with a hyper-converged system are notoriously difficult to resolve. Those problems loom large when administrators are responsible for the entire infrastructure.
Any reliable hyper-converged system vendor will act as the first point of call for any issues, as well as help troubleshoot and, if necessary, open tickets on your behalf. This eliminates the need for separate hardware and software vendors and leaves only one throat to choke.
Hyper-convergence also puts the administrator squarely in the frame for any issues on the customer side because all services now run under the same set of hyper-converged boxes.
Evaluate hyper-converged platforms
Before you invest in a hyper-converged system, look to see whether it's on the hardware compatibility list (HCL) of the hypervisor you're running. All top-tier systems will be certified. However, if you go off-brand, be sure to check these things, and buy complete, certified platforms. If you don't see the hyper-converged system you want to purchase on the HCL, many vendors will try to assist on a best effort basis, though such attempts aren't always successful.
In my opinion, anyone who uses a hyper-converged system should leave the hardware to the hyper-converged infrastructure (HCI) vendors. The reason I say this is that, when you set up your own bespoke HCI, you lose the ability to use the one-stop shop that most vendors provide. This is especially true if the administrator is new to hyper-converged system management.
Vendors have special teams that deal with hyper-converged systems and are trained in both the hypervisor and the HCI nuts and bolts. For example, when I have a potential hardware problem with Nutanix, I ring Dell, my own personal hardware manufacturer of choice for a hyper-converged system. Dell then assists in getting to the root cause. If needed, Dell will open a call with VMware for third-line support. This is usually included in the price.
When you buy a Nutanix, SimpliVity or EVO:RAIL platform, it comes with everything you need, including premium hardware and software support for five years.
Troubleshoot hyper-convergence issues
So, what can an administrator do to make his life easier? The first and most important aspect is to have extensive monitoring and logging in place. Most vendors offer quality analytics and dashboards to the user to help identify -- and, hopefully, resolve -- hyper-converged system issues quickly. Health check dashboards are a great first place to look for issues. They also provide useful information, such as deduplication and compression ratios, as well as other useful operating data.
Access to vRealize Operations can make troubleshooting much easier due to its superior inbuilt analytics. Several vendors have gone so far as to create content packs that you can import into vRealize to provide enhanced analytics.
According to Duncan Epping, chief technologist for storage and availability at VMware, the key to troubleshooting a hyper-converged environment is to familiarize yourself with the recovery tools you have at your disposal before so that, in the event of a problem, you know what to use and where to look.
Epping had some additional advice to impart to hyper-converged users:
- Don't suffer from lack of information. Set up and use external -- to the cluster -- logging. This means that, even if the cluster is in an inoperable state, the administrator can still access the logs and use them to troubleshoot.
- Understand that HCI activities, like rebooting, have a knock-on effect with other HCI nodes. Unlike classic clusters, the HCI node participates in other operations, such as the vSAN storage array, so tread carefully. It's probably a good idea to open a support call and verify before you reboot a host from a degraded cluster.
In addition to Epping's suggestions, it's also a good idea to pay attention to alerts. Though they may seem trivial, an alert can be symptomatic of a larger underlying problem. For example, a recent alert indicated a massive I/O spike during the hypervisor upgrade. The alert appeared as a warning in the analytics window. Upon further investigation, this huge I/O spike caused most of the cluster's VMs to become read-only. Take every alert or warning seriously, and make sure they are immediately resolved; hyper-converged systems can be very sensitive to the infrastructure not working as it should.
In summary, hyper-converged platforms provide a lot of savings and capabilities but come with complexities as well. Hardware and software vendors try to make using hyper-converged systems as easy as possible by providing services and software to help troubleshoot HCI issues as they occur. The next step is to make sure that the administrator has the understanding and capability to efficiently use these tools.
Review hyper-converged appliances based on workload requirements
Get your data center ready for hyper-convergence
Navigate different DIY hyper-converged options