Are we witnessing the death of local instance storage?

There are different ways to make sure local instance storage is robust but it depends on the situation.

While searching for bigger, better and faster public cloud instances at the best possible price, it’s easy to lose track of what really matters. The issue that really matters is data – protecting it, maintaining availability and the proper disposal of it.

 For those of you still in the dark ages, local instance storage is one or more hard drives or SSDs in the virtualized server, with part of them allocated to your instance. Think of local instance storage (LIS) as a place to park data, just as direct-attached storage (DAS) is used on a server.

LIS has the same drawback as DAS. If you don’t use RAID, the data isn’t protected. AWS, for example, warns that “data in the instance store is lost under the following circumstances: the underlying disk drive fails, the instance stops, or the instance terminates.”

The problem is partially fixable by using an instance type with more volumes and then applying software RAID to them. But that option is more of a Band-Aid then a true fix. If the server fails, the LIS is irretrievably lost and the data won’t reappear on the replacement instance. Moreover, LIS can’t be migrated between virtual instances, so if degradation occurs due to a failed drive, the choice is to either live with it (not advised) or go through a process of copying out data to permanent storage and then building a new instance.

It’s worth reflecting how we got to this point. Networked storage and LAN connections that seemed adequate for stand-alone server environments are often way under-powered for many virtual environments. Sharing IO between many tenants really dilutes operations and the large base of slow hard-drive network storage even in the cloud doesn’t help any.

LIS is a desperate attempt to gain some performance back, and it’s especially important as we try to move big data applications into the cloud. With that said, we have to make LIS robust under all circumstances, and that puts the burden on the tenant to get it right. How this is done depends on use cases.

For use cases where data written is essentially streamed out to disk, users should create a mirrored journal file in LIS and write a daemon to move the data to networked storage regularly. This makes the networked operations more efficient.

Things get complicated if the IO pattern is random. One option that simplifies operation might be to use cache software, but there's a trade-off. Write-through cache mode guarantees a write operation is flagged complete only when data has reached the network storage device. That means writes can take milliseconds to complete, but it’s the only way to ensure that a fully-synced copy of data is available if the server fails.

For less critical data, a transaction sequence number of some sort may be an option. Run the cache in its fastest write-back mode, but keep a tab on which transactions are completed. As these hit the networked store, the difference between the completed and uncompleted tab lists for a server could be used to recover and re-do operations.

A read-mostly application may not need these precautions, especially if the flow can be disconnected until the infrequent writes complete on the networked drive. In this situation, all writes hit LIS and the networked store before the complete status is flagged.

With data protection in place, the next issue is what happens when an instance is disconnected. Reboot shouldn’t kill the instance store, but closing down the instance, whether planned or accidental, surely will.

A planned shut-down of an instance requires more than just copying any dirty blocks out of the cache and into the networked storage. Leaving data behind is a bit dangerous. On a hard-drive LIS, erasing any data should work, since there is a 1:1 relationship of logical blocks to physical blocks. However, this is not true on SSD.

On SSD, when a block is overwritten, the old data is tagged as erased but no actual write occurs.  A new block is used for the new data and the old block is placed in a pool to be gathered up by a background operation and re-initialized at a later date. In this situation, your old data may reside in the spare block pool for quite a while and be readable by any tenant with a utility to look at bare SSD.

Despite there still being quite a number of hard-drive supporters who haven’t embraced SSD, the much higher performance of SDD is better aligned with the needs of big instances in most use cases. That fact and the knowledge that in two years' time we will be at or near parity in SSD capacity and pricing against bulk hard-drives means that the best choice for LIS will move rapidly to SSD. Knowing the dynamics of the death of an instance is critical to your peace of mind.

Next Steps

Will local storage for virtualization catch on?

Using local storage without hyper-converged products

Improving local storage performance on virtual servers

Dig Deeper on Virtual server backup and storage