The Fallacy of Flash Memory Based Storage Longevity (for Storage and Backup)
4-10-2009 So-called netbooks, with their flash-based storage (called SSDs) are quite the rage right now. I've always been interested in small, portable, quiet, low-power computing devices, and so I've been watching this area of technology develop. (More casual technologists might know flash media as a USB pen drive, Compact Flash, SD, microSD, and so on.) I have, in fact, been using an older NEC MobilePro 780 for roughly 6 months with a swap partition on CF with no ills (so far). But... There have been a number of threads on Slashdot lately which have delved into discussion on flash storage media - longevity, speed, and so on. It got me thinking about testing various flash media, so I started to do some research on how to best write a small program to test real-world the longevity of flash storage in long-term use conditions. Largely due to its relatively recent introduction to the 'cheap' segment of the consumer market and use in many trendy devices such as MP3 players, netbooks, and so on, there appears to be a lot of misconception about flash memory's utility. Some of the misstatements I've heard stated about flash based storage are:
- Flash storage is much faster than a hard drive (it is, but only in some conditions)
- It's solid-state, so it won't fail like a hard drive (no, it'll fail differently)
- It is non-volitile, and it will fail gracefully, so you will know when it's failing and can simply stop trying to write to it, and recover the data whenever you feel like it.
- Is there any veracity in the claims that flash shouldn't be used for swap?
- Will flash really hold up to the number of writes advertised?
- When flash memory fails, how will it do so - gracefully, violently, unexpectedly, transparently?
- flash drives do wear leveling, which distributes writes over the device.
- flash drives all do wear leveling differently (and some very poorly) so the results could not be even loosely used to determine "flash memory" failure rates.
- Even if you determined that X card or pendrive from vendor Y fails at a high rate, it would have little to do with other products from even the same vendor, as they change their memory vendors frequently.
- When flash memory fails, flash memory fails in an unpredictable ways.
- By far, the most likely apparent way for flash storage to fail is through human interaction: handling, dropping, inserting, and general physical-wear related death. The most likely cause is electrostatic shock. In these cases, it would seem that the device is dead or the filesystem is wiped.
- A silent, catastrophic failure seems to be the second most likely. Data starts getting corrupt and simply "not written" long before errors are reported. This may, or may not, be related to write leveling designed specifically for FAT32 filesystems, as I've heard of this happening in both Windows and Linux (with ext2/3).
- A gradual, graceful failure, where writes start to fail but the data on the device remains integral. I've only heard this once or twice.
Another fascinating aspect of flash-based SSD is that you don't seem to get any
report of checksum failures on corruption - at least I haven't seen one in the
three confirmed cases of flash corruption I've seen. I don't know if this is
because the device isn't reporting it or because the OS driver isn't listening
for it, but it's what happens.
At present time, many disks still have higher bandwidth than many SSDs. SSDs
still have a performance penalty for non-sequential IO vs. sequential IO - not
as high as a disk seek, but enough to drop throughput by a factor of two or so.
They also have high overhead for small random writes due to the need to erase
the entire erase block the target block is located in. So SSD will beat the
pants off of a disk on an uncached random read workload (e.g., system boot-up),
but disks have the advantage on streaming reads and both streaming and random
writes, generally speaking.
When it comes to flash, manufacturers are handicapped when predicting long-term
failure rates for a number of reasons. First, it's hard to extrapolate failure
rates under stress tests to long-term failure rates. In particular, the failure
mode of a flash cell is that the charge leaks out of the cell - slowly, over
time. Stress testing by writing to the cell a lot and then reading the data
back is not going to test this situation. In general, we simply don't have a
lot of experience with testing flash and it will take a few years to build it
up.
From further down in the comments on the same thread, by Anon:
Both JFFS2 and UBIFS are designed for use on raw NAND, but the chip industry
trend is toward "managed NAND" - SSDs and similar schemes like eMMC and
LBA-NAND. The reason for the trend is because the market is pushing for ever
greater sizes, which translates in hardware to smaller process geometries,
larger FLASH page sizes, and wider ECC. All of those factors end up requiring
different interface chips, which hurts the ability of the manufacturers to
deploy their new chips in existing designs. With the "managed NAND" approach,
the hardware details can be hidden behind a built-in microprocessor.
The same sort of thing happened 25 years ago in the rotational disk arena.
"Raw" disk interfaces like ST-506 gave way to "smart" interfaces like IDE and
SCSI. Once that trend started, the raw interfaces disappeared very quickly.
While it's true that many Flash Translation Layer implementations suck, that's
not the same as saying that they all suck. Hopefully, with the ever-increasing
importance of Flash storage, the sucky implementations will start to go by the
wayside.
A PDF by Micron on wear leveling:
Wear leveling can help extend the useful life of NAND Flash devices and is
often neces- sary to ensure that the devices reach the specified endurance
rating by equalizing the wear of good blocks. The use of wear-leveling
techniques is imperative in NAND Flash devices, regardless of the individual
device’s endurance rating. The most effective wear-leveling method is static
wear leveling because it typically provides more uniform block usage than
dynamic wear leveling. Although dynamic wear leveling is typically inferior
to static wear leveling, this method is easier to imple-ment and can still
provide enough wear leveling to meet the needs of many applications.
Conclusion
As far as a conclusion, or anything like that? To each their own, but I will be staying away from flash storage for some time to come, I think. Oh, I'll be using it, but I will not be using it for serious storage - ie, something I expect to Reasonably Not Die. If it's important, I'll back it up on a different type of media, and not have just that copy, at any time. I suggest anyone else do the same. Basically, I'm deciding to view flash storage, particularly the USB variety, as the floppy drive replacement. This is how most people use pendrives, anyway: The New Floppy. They're much, much more useful than a floppy ever was (aside from the irritating bootability problems which aren't fully sorted out). But they're not at a place yet where they're predictably dependable, and there's just too much variance now, to make any sort of bets in favor for the medium. Update:Here is a pastebin of an Asus Eee's secondary SSD failing with the ext3 filesystem. From several accounts I've seen, ext3 will sometimes report write failures (depending on the specific flash media used). Here is a slashdot thread where I ask this question and get a fair amount of concurrence with the above information.
(c) 2009 Benjamin Hodgens