Not so long ago, disaster recovery was a costly proposition with a lot of redundant hardware sitting idle, waiting...
for the moment when the DR environment may be called into action.
Testing your disaster recovery (DR) environment was also time-consuming and fraught with danger. Fortunately, times have changed, at least in the virtual world. Doing everything in a virtual environment means the cost and complexity of DR has fallen quite substantially. As long as you have spare capacity at the destination site, the process is more manageable for the administrator.
It used to be that proprietary disk replication technology, such as EMC Symmetrix Remote Data Facility, would need to be used to replicate storage at the lowest level, and bringing up infrastructure at the other side would take some time as it would involve breaking replication, presenting disks at the storage level, scanning them into VMware and then powering on and configuring the server. All of these steps take time, which can be ill-afforded in a disaster scenario.
Contrast this with the new virtualization-aware DR products that support major hypervisors. If necessary, an administrator can fail over tens of servers as easily as clicking three or four buttons. Most of the products even allow prepopulation of failover IP and domain name system (DNS) settings to reduce the workload at the most crucial point in a DR setting.
Virtualization enables better DR
Technology has evolved to allow easy failover and replication products from several companies. The big names in this technology are VMware, Zerto and Veeam Software.
The first generation for virtual DR was VMware Site Recovery Manager. It came with a lot of baggage and supporting infrastructure. It gained the reputation of being difficult to use in real-world scenarios and was very prescriptive in how things would work.
The second generation of DR technologies came from products produced by Zerto and Veeam Software. What makes products from these companies very useful is that they do not require burdensome raw devices -- at least in the majority of cases. You don't need to have an approved storage area network infrastructure with all manner of plug-ins. Perhaps the most useful function is the ability to do full nondisruptive failover tests. The testing is usually done within an isolated network bubble or isolated virtual local area network, so as not to risk impacting the live implementation.
In a purely virtual environment there are a lot of advantages to these products. When the servers are virtual, you can group the VMs together into what is known as a protection group. The beauty of a protection group is that it provides a crash-consistent copy of the all the VMs in the group.
In other words, all the VMs within the group will migrate from the primary site to the DR site at the same point in time. This only works in fully virtual environments or where the servers involved in the application tier are virtual.
The technologies these companies use work in a broadly similar manner. All the products essentially have a driver that sits in the ESXi storage stack. When the data is written to disk, the driver intercepts all the writes to the disk and mirrors them across the wide area network to the DR site, and then they are written to the DR disk.
If you have a disk I/O intensive application, this will be reflected in the bandwidth I/O requirements, although many products do employ caching and compression algorithms to reduce this.
Perhaps one of the most overlooked features is the ability to choose the recovery point in time from hours' or days' worth of restore points. It means that if someone made a bad change or you got hit by crypto ransomware, you could choose a point in time to restore to. These uses of recovery points are perhaps not the best way to tackle the mentioned issues, but they give some insight into how useful virtual DR can be.
Creating an affordable virtual DR plan
Although the overall cost associated with DR has fallen, there is still a licensing cost involved for virtual DR products, as well as a cost to maintain the necessary storage space. Administrators can help mitigate these costs by resisting the temptation to include every server in their DR plan. Not only is it excessively costly to include every server, but it is also burdensome when there is a real disaster.
A better way to plan your virtual DR is to sit down with the interested parties -- IT administrators, application owners and users -- and decide which applications are critical, termed tier one. Once the list is created, plan around and document what was decided. Make sure that all pertinent information to failover is included. Being in a real DR situation and realizing some key information is missing is a nightmare scenario.
Applications don't live in isolation, and as such, an administrator needs to look at the supporting infrastructure, such as Active Directory, DNS and time servers.
The costs, while reduced, are still not negligible. The cost factors involved include:
- The licencing cost per protected node -- most products are licensed on a per-node basis.
- The cost of maintaining the bandwidth for the data replication.
- Spare capacity to bring those protected nodes up in the event of a failure.
All of these costs would need to be addressed, and as noted they may not be seen as cheap, but they provide proven protection in both test and real DR. At the end of the day, protecting your key infrastructure and the revenue it brings should be the number one priority, because companies that do suffer significant data losses or outages stand a one-in-three chance of going out of business.
Tips and tricks for lowering DR costs
Essentials for building a DR plan
Navigating DR costs in the virtual world