Benjamin Hodgens

A little bit of Web 2.0 nothing.

The Fallacy of Flash Memory Based Storage Longevity (for Storage and Backup)



4-10-2009

So-called netbooks, with their flash-based storage (called SSDs) are quite the rage right now. I've always been interested in small, portable, quiet, low-power computing devices, and so I've been watching this area of technology develop. (More casual technologists might know flash media as a USB pen drive, Compact Flash, SD, microSD, and so on.) I have, in fact, been using an older NEC MobilePro 780 for roughly 6 months with a swap partition on CF with no ills (so far). But...

There have been a number of threads on Slashdot lately which have delved into discussion on flash storage media - longevity, speed, and so on. It got me thinking about testing various flash media, so I started to do some research on how to best write a small program to test real-world the longevity of flash storage in long-term use conditions.

Largely due to its relatively recent introduction to the 'cheap' segment of the consumer market and use in many trendy devices such as MP3 players, netbooks, and so on, there appears to be a lot of misconception about flash memory's utility. Some of the misstatements I've heard stated about flash based storage are:

This last claim - that it'll fail gracefully - is the claim which is most likely to cause problems for users. So, as part of my tests, the key things I wanted to determine were:

Unfortunately, as I started writing my program, I ran into quite a bit of information that it's all but pointless to test something like repeated writes to a specific sector of the memory to see how long it fails. There are a number of reasons derived from a myriad of blog and forum posts:

Basically, flash memory devices are too heavily abstracted from the actual memory for the operating system to usefully track things like failures or error rates.

I did actually run into the results of a similar test on flash storage to the one I had intended to run. Ultimately, I don't think this approach to testing is effective due to the multitude of different and inconsistent methods flash media vendors use for wear leveling. If they were consistent, this test would hold more weight outside of that specific vendor and card, but as it stands, I do not think so.

It'll certainly be worth looking at again when there's a standard interface for flash memory controllers so the OS can access the actual hardware.

I have personally never had flash memory fail on me, but then I have always viewed it with a skeptical eye and have had little use for it (networking, baby!). However, here are some of the different ways I've heard of them failing:

I haven't seen a correlation between brand, age, or quality/price, or anything like that. I'm not saying it doesn't exist, I just haven't been able to reach anything conclusive: I'd need funding to persue that avenue further. :P

Here are some choice quotes from various write-ups I've found which speak to this issue: From, "To SSD or not to SSD?":
Another fascinating aspect of flash-based SSD is that you don't seem to get any report of checksum failures on corruption - at least I haven't seen one in the three confirmed cases of flash corruption I've seen. I don't know if this is because the device isn't reporting it or because the OS driver isn't listening for it, but it's what happens.

At present time, many disks still have higher bandwidth than many SSDs. SSDs still have a performance penalty for non-sequential IO vs. sequential IO - not as high as a disk seek, but enough to drop throughput by a factor of two or so. They also have high overhead for small random writes due to the need to erase the entire erase block the target block is located in. So SSD will beat the pants off of a disk on an uncached random read workload (e.g., system boot-up), but disks have the advantage on streaming reads and both streaming and random writes, generally speaking.
When it comes to flash, manufacturers are handicapped when predicting long-term failure rates for a number of reasons. First, it's hard to extrapolate failure rates under stress tests to long-term failure rates. In particular, the failure mode of a flash cell is that the charge leaks out of the cell - slowly, over time. Stress testing by writing to the cell a lot and then reading the data back is not going to test this situation. In general, we simply don't have a lot of experience with testing flash and it will take a few years to build it up.
From further down in the comments on the same thread, by Anon:
Both JFFS2 and UBIFS are designed for use on raw NAND, but the chip industry trend is toward "managed NAND" - SSDs and similar schemes like eMMC and LBA-NAND. The reason for the trend is because the market is pushing for ever greater sizes, which translates in hardware to smaller process geometries, larger FLASH page sizes, and wider ECC. All of those factors end up requiring different interface chips, which hurts the ability of the manufacturers to deploy their new chips in existing designs. With the "managed NAND" approach, the hardware details can be hidden behind a built-in microprocessor. The same sort of thing happened 25 years ago in the rotational disk arena. "Raw" disk interfaces like ST-506 gave way to "smart" interfaces like IDE and SCSI. Once that trend started, the raw interfaces disappeared very quickly. While it's true that many Flash Translation Layer implementations suck, that's not the same as saying that they all suck. Hopefully, with the ever-increasing importance of Flash storage, the sucky implementations will start to go by the wayside.
A PDF by Micron on wear leveling:
Wear leveling can help extend the useful life of NAND Flash devices and is often neces- sary to ensure that the devices reach the specified endurance rating by equalizing the wear of good blocks. The use of wear-leveling techniques is imperative in NAND Flash devices, regardless of the individual device’s endurance rating. The most effective wear-leveling method is static wear leveling because it typically provides more uniform block usage than dynamic wear leveling. Although dynamic wear leveling is typically inferior to static wear leveling, this method is easier to imple-ment and can still provide enough wear leveling to meet the needs of many applications.

Conclusion

As far as a conclusion, or anything like that? To each their own, but I will be staying away from flash storage for some time to come, I think. Oh, I'll be using it, but I will not be using it for serious storage - ie, something I expect to Reasonably Not Die. If it's important, I'll back it up on a different type of media, and not have just that copy, at any time. I suggest anyone else do the same.

Basically, I'm deciding to view flash storage, particularly the USB variety, as the floppy drive replacement. This is how most people use pendrives, anyway: The New Floppy. They're much, much more useful than a floppy ever was (aside from the irritating bootability problems which aren't fully sorted out). But they're not at a place yet where they're predictably dependable, and there's just too much variance now, to make any sort of bets in favor for the medium.

Update:

Here is a pastebin of an Asus Eee's secondary SSD failing with the ext3 filesystem. From several accounts I've seen, ext3 will sometimes report write failures (depending on the specific flash media used).

Here is a slashdot thread where I ask this question and get a fair amount of concurrence with the above information.