Errors…Corrections Made in an Imperfect World

- written by Brian Cook, Dataram Memory Blog Team Member

Nothing in life is error-free, and computers are no exception.  Computer systems are designed to be very reliable, and to give years of trouble-free dependable service.  However, occasionally hiccups can and do happen.  In the world of computer memory, errors can occur.  Fortunately, many errors are harmless and do not repeat.  They are corrected by the memory controller via error detection and correction (ECC) algorithms and technologies.  Furthermore, memory is one of the most reliable components in a computer system—much more reliable than their mechanical hard disk drive storage cousins.

Working in the Technical Support department at a memory company, you encounter pretty much every type of system and memory problem one can imagine.  Sometimes there is an easy solution to an error message (re-seat or swap a DIMM) and sometimes you can troubleshoot a system for hours, only to find out that the problem isn’t memory related at all!

At Dataram, we go to great lengths in our ISO 9001 certified facility to design, manufacture and test reliable compatible memory that delivers great performance.  However, on the rare occasion that you do come across a memory related error, the following information may be useful to you.

Basically you can break down memory issues into two types.  Installation failures can occur right after you install or upgrade the memory, and operational failures, which occur farther down the road, after the system has been running solid for some time.  Let’s take a look some examples of failures, and how to address them:

Problem:  The system will not pass POST (no video or beeps).

Solution checklist:

  1. Ensure that you have the correct memory for the system you have installed it in.
  2. Check the user manual or contact Dataram Product Support to ensure that the memory is installed in the correct slots.  (Sometimes larger capacity quad-rank DIMMs need to be installed in a certain order.)
  3. If you are mixing capacities within a system, make sure you follow the user manual to ensure the DIMMs are installed in the correct slots.
  4. If you are sure the DIMMs are configured correctly, reseat (remove and re-install) the memory.
  5. Ensure you have the latest version of the system firmware/BIOS/open boot prom rev level.

Problem:  The system passes POST but the BIOS does not count or recognize all of the memory.

Solution checklist:

  1. Reseat the memory.
  2. Ensure you have the latest version of the system firmware/BIOS/open boot prom rev level.  Older firmware can cause systems to recognize brand new 4GB or larger DIMMs because they were not available when the original firmware was written.
  3. If unsuccessful, try solutions 2 and 3 in the next problem.

Problem:  The BIOS counts the memory correctly and passes POST, but a DIMM or bank of DIMMs is deallocated or disabled by the system processor or OS.

Solution checklist:

  1. Reseat the memory.  (I am starting to sound like a broken record).
  2. Once you boot to the OS, you should see which slot(s) are deconfigured or disabled.  Replace the DIMM(s) with a spare(s) and that should correct the problem.
  3. If your OS does not display deconfigured memory, you will need to replace one bank at a time until you find the suspect bank.
    Note:  Banks can also be referred to as quads, channels, or groups of DIMMs.

Problem:  The system has run for an extended period of time, but now has an issue or crashes.

Solution checklist:

  1. Check the error logs to see if a DIMM location has been identified with an error.  Error types includes:
    • Correctable Error (ECC).  The system is still operational, but the event is logged.

 

    • Correctable Error count has exceeded the threshold set: (types include intermittent, persistent, and sticky).  This DIMM should be replaced as its behavior indicates it is approaching end-of-life.

 

    • Chipkill Error.  This is a multiple bit error that has been corrected.  If repeated, this DIMM should also be replaced.

 

    • Uncorrectable Error.  None of the existing Error Correcting technologies could correct the event.  This DIMM should also be replaced.

 

  1. To keep your system error free and extent its life, ensure it is receiving sufficient air flow and cooling.  The system fans must be functioning properly, and the airflow vents must not be blocked.  If equipped, make sure the system airflow baffle is properly installed. Monitor the various areas and racks of the datacenter to ensure there are no “hot spots”.  Make sure the inside of the system and the airflow vents are clean.  Dust particles and hair strands can lodge in a memory DIMM socket causing failures.

These tips and solutions will fix most of the problems you encounter.  However, if you’re still having trouble, contact our Product Support team at support@dataram.com or give us a call at 1-800-599-0071, or 609-897-7014 for our international friends.

This entry was posted in Memory Posts. Bookmark the permalink.

Comments are closed.