Q
Problem solve Get help with specific problems with your technologies, process and projects.

How should I address memory errors in a virtualized system?

Many hypervisors have features that can prevent memory-related problems from escalating into devastating server crashes.

Server reliability depends on identifying and recovering from errors that might otherwise crash a system -- crashing...

every VM on the server at the same time. Memory errors are one important example. Many memory errors are "soft" errors which are corrected and not easily repeated, but could crash an entire server without mitigation.

With advances in memory subsystem design, memory errors are recorded to small logs on serial presence detect chips located on each DIMM. The system can then use that error data to identify addresses that might be problematic, and avoid using those addresses or pages that include questionable address. An example here includes hot spare capability where a fault on one DIMM would cause content to be switched to a backup DIMM already installed and standing by on the system -- the server would simply stop using the questionable DIMM and alert a technician that a spare DIMM had been invoked.

Hypervisors can also read memory error logs and make intelligent determinations about questionable memory addresses. For example, a memory address that reports an unusual number of corrected errors -- such as errors fixed using error correcting code -- might signal an impending hard fault of the DIMM. Hypervisors like VMware's ESXi can stop using pages with problematic addresses, preventing memory errors from escalating and possibly disrupting a VM or the entire system.

At the same time, such error isolation behavior can report error findings to hypervisor event logs and even administrative trigger alerts for further investigation. This allows the server to continue running until a technician can migrate VMs to other servers and take the troubled system offline for detailed troubleshooting and repair. Even when memory troubleshooting tests are inconclusive, the questionable DIMMs might be replaced preemptively as a matter of course.

Memory is a critical resource for virtualization -- and is often the limiting resource for server consolidation -- but memory technologies are constantly improving. Hypervisors have long-supported overcommitment which can identify and reallocate idle memory, and emerging systems can share common memory content across multiple virtual machines while compression can cache idle pages without the need for disk swapping. All of these developments contribute to better resource utilization, greater consolidation levels, fewer memory errors and more reliability for virtual machine environments.

Next Steps

Solving the mystery of the disappearing VM

Troubleshooting VM migration errors

Top five VMware SRM errors and how to fix them

This was last published in March 2016

Dig Deeper on Server consolidation and improved resource utilization

PRO+

Content

Find more PRO+ content and other member only offers, here.

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.

Join the conversation

2 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

How do you handle memory errors in your virtualized environment?
Cancel
Memory errors are a big headache for data centers.  The memory used in the vast majority of servers today is DDR3 memory.  This memory suffers from a known failure mechanism called Row Hammer failures (Google: DDR Memory Row Hammer).  ECC can only detect/correct single bit errors.  It can detect double bit errors but after that its undetectable corruption!  We have become to reliant on these servers for critical applications.  Not enough is being done to create a robust memory subsystem in these servers.  I should know....I test these servers and DIMM modules.  Scary stuff....
Cancel

-ADS BY GOOGLE

SearchVMware

SearchWindowsServer

SearchCloudComputing

SearchVirtualDesktop

SearchDataCenter

Close