
Memory testing: A quick look
NAND Flash, like mechanical storage devices, is by default assumed to be unreliable – an unusual situation in the electronics world. Its unreliability is dealt with by using dedicated controllers. DRAM, on the other hand, is deemed "pretty" reliable. Servers often have error detection (and possibly correction) circuitry, but consumer and commercial machines rarely do. I’ll focus on DRAM here.
While most fine-geometry electronics is subject to radiation-induced SEUs (single event upsets), DRAM has the extra disadvantage of being, at heart, an analog technology. Billions of tiny capacitors must maintain their charge without fail, or at least until the next refresh. If one of those caps is a bit leakier or smaller than it should be, it could result in unreliable operation. What’s maddening is that the failure mechanism can be intermittent, or data-related.
As memory timing becomes ever faster, it behooves us to run eye tests as we would on a gigabit serial channel. For example, here’s a DDR3 eye at 1.33Gb/s:

What test data should we run, whether software- or hardware-generated, to exercise the eye? Pseudorandom is a good start.
My interest in memory testing dates all the way back to my first computer. After suffering with 512 bytes of memory for a year, I added a 32kB DRAM board! Whether due to bad board design, or really questionable DRAM chip quality, memory errors were frequent, and sometimes subtle. Simple patterns would detect the worst of the errors, but it wasn’t till I implemented a pseudorandom test that the rarest errors appeared.
Were these errors due to bad eyes? Probably not. More likely, they arose from pattern sensitivity, where data bits on the DRAM chip interact in some way known only to the chip’s designers. Though given the state of PCB design ca. 1980, it’s also possible that poor power integrity was the root cause.
As end users of DRAM-bearing machines, we can still perform thorough memory testing through software-only means, such as the excellent Memtest86 program, which includes pseudorandom among its many tests. The bottom line, in my experience, is that pseudorandom is the best way to track down flaky errors that don’t show up using other methods.
As designer of a memory subsystem, we’re able to go beyond this approach and get probing. Using a pseudorandom pattern not only creates a proper eye diagram for our scope, it will likely uncover any pattern sensitivity problems (though this is more of a per-unit test, whereas you’ll only be eye testing in development).
What kind of memory failures have you encountered? And what tests have you implemented for verification and production?
Follow Michael Dunn on Twitter
