Investigation: Is Your SSD More Reliable Than A Hard Drive?

AngryParsley · on Sept 19, 2011

One thing the article didn't address is performance over time. Even with TRIM support, SSDs get slower as they're used. Occasionally, cells die prematurely causing spare capacity to decrease. This won't affect read speeds much but it will hurt writes. These older SSDs will still be faster than hard drives but they won't be as fast as you'd expect them to be.

Now for some of my own data. Here's an Intel X25-M G2 after a lot of usage: http://abughrai.be/pics/ssd_erase/Screenshot-160%20GB%20Soli... and here it is after an ATA secure erase: http://abughrai.be/pics/ssd_erase/Screenshot-160%20GB%20Soli...

The G2 has TRIM support and this drive was used on an ext4 filesystem with TRIM enabled. After the erase, performance was almost back to that of a pristine drive. There was a 6GB swap partition on the drive as well as the ext4 partition. I'm pretty sure swap partitions aren't trimmed, so that could have been the reason for the performance degradation.

Artagra · on Sept 19, 2011

I agree completely that, even with TRIM, the benchmarked performance of an SSD decreases over time. However, one thing to remember is that the real-world performance remains very similar, and that the average person won't be able to tell the difference. I recently had two identical Macbook Pro's next to each other, one with a Intel X25-M G1 drive in it that had been used for nearly 18 months, and the second with a fresh out the box X25-M G2. Even taking into account the fact that the G2 was factory fresh, and the G1 (a drive that does not support TRIM) was significantly slower benchmark wise, the experience of using the two machines was almost identical. There were differences, but they were very slight. I doubt I could have reliably picked which was which in a double blind test.

So for desktop use, I wouldn't worry about the performance over time of a good SSD.

caf · on Sept 19, 2011

The kernel should really issue a large TRIM for most of the swap partition at swapon time.

caf · on Sept 20, 2011

...and in fact, it does so: http://lxr.linux.no/#linux+v3.0/mm/swapfile.c#L2117 (and has since kernel 2.6.29).

Additionally, if you set the SWAP_FLAG_DISCARD flag to swapon(2), it'll issue TRIMs for swap space that is freed at runtime. You'll need a pretty recent /sbin/swapon to support that flag, though.

nordsieck · on Sept 19, 2011

Once upon a time, it was best practice to keep the swap file/partition (or log files, etc.) off of the SSD in order to keep the frequent writes from wearing out the SSD too quickly. Do you feel like this is outdated advice?

lloeki · on Sept 19, 2011

Certainly putting the swap not as a partition but as a file on the filesystem will help with TRIM and maybe even wear leveling.

caf · on Sept 20, 2011

It won't make a difference - in both cases, the swap code accesses the underlying block device directly, and is responsible for TRIM. (The swap code does issue TRIMs though, since kernel 2.6.29).

shanekenney · on Sept 19, 2011

What would you define as a lot of usage?

AngryParsley · on Sept 19, 2011

I'd say it got about 5GB/day of mostly-sequential writes for a year, and I base that on absolutely no quantifiable data. It was used for lots of development; compiling stuff, running an IDE and browser in addition to the full stack of stuff in production. It had MySQL with a prod snapshot, RabbitMQ, a Django web app and a bunch of twisted services. The biggest writing culprits were restoring MySQL snapshots and ridiculously verbose logging from services.

JoeAltmaier · on Sept 19, 2011

Most raid controllers don't support TRIM on arrays. Were you using a raid set perhaps?

jrockway · on Sept 19, 2011

I don't know how the data works out for this, but I imagine that SSDs in laptops fail a lot less often than spinners do. "Enterprise use" is the perfect place for rotating storage: the machines aren't dropped off desks, they aren't power-cycled very often, and they aren't stuffed into backpacks while turned on (nearly causing a fire). SSDs don't care much about these things, while spinning disks do. So I imagine that if you have an SSD in your laptop, you are less likely to experience drive failure.

(Another nice thing: those new micro-SATA SSDs are small enough that you can conceivably RAID-1 them in a tiny laptop!)

bengl3rt · on Sept 19, 2011

At the moment, the reliability problem with SSDs stems entirely from firmware bugs, rather than the underlying flash technology. All the issues you hear about with regards to drives causing blue screens or simply failing to be recognized by the system at all after a while are issues with the firmware on the controller chip - the actual flash chips themselves are pretty dumb and rarely fail catastrophically.

This will get better with time, as SSD firmware accumulates the kind of run time (in terms of number of hours x number of units in use) over years that HDD firmware has had.

kabdib · on Sept 19, 2011

It would be nice to see the SSD manufacturers get a clue about recovering data. Having the device fail catastrophically is nuts; it means that all of the carefully designed recovery schemes in the file system are basically worthless.

Yuck.

bengl3rt · on Sept 19, 2011

Agreed. Right now it is a performance arms race - IOPS trumps everything.

However, bigger and bigger enterprise vendors are doing rigorous qualification tests on SSDs for white-labeling purposes and demanding that the firmware be bulletproof. Admittedly the enterprise workload is a bit different from a laptop (servers never go to sleep) but they are the ones accumulating the most runtime given 24 hour uptimes.

cpswan · on Sept 19, 2011

Indeed. Perhaps I have been unlucky, but of 5 SSDs that I've ever touched 2 have had catastrophic controller failures.

HaloZero · on Sept 19, 2011

The short summary is that nobody had used SSDs in any large scale for more than 2 years and the rates of failure for SSDs in that period of time is similar to regular hard drives.

Also apparently OCZ SSDs aren't as good as Intel SSDs (based on customer return rates).

mrb · on Sept 19, 2011

I get annoyed when people make generic claims that SSDs from vendor X are not as reliable as vendor Y.

The truth is that each vendor manufactures multiple lines of SSDs, each running different major firmware versions, with different hardware revisions. So reliability variance between models from a single vendor is often much more significant that variance between different vendors.

In your experience with OCZ, which model do you have experience with? Vertex 2? 3? Agility? Enterprise-class Talos? What about the Intel ones? Were they 50nm NAND-based SSDs? 34nm? You need to share some of these details. Claiming that Intel > OCZ is devoid of significance.

biot · on Sept 19, 2011

The article makes those claims, based mostly on anecdotal evidence rather than a peer-reviewed scientifically rigorous study. It concludes with: "Giving credit where it is due, many of the IT managers we interviewed reiterated that Intel's SLC-based SSDs are the shining standard by which others are measured."

While an itemized breakdown by datacenter isn't available one datacenter (InterServer.net, with fewer than 100 SSDs) says: "Intel SSD's are night and day in failure rates when it comes to some other drives. For example the SuperTalent SSD drives have had an extremely high failure rate including model FTM32GL25H, FTM32G225H, and FTM32GX25H. I estimate about two-thirds of these drives have failed since being put into service. With these failures however, the drives were not recoverable at all. They generally disappeared completely, no longer being readable. Spinners die much more gracefully with an easier disk recovery. I cannot compare this to the Intel's SSDs yet since I have not experienced any failures.". They apparently use Intel's X25-E (SSDSA2SH032G1GN).

moe · on Sept 19, 2011

They generally disappeared completely, no longer being readable. Spinners die much more gracefully with an easier disk recovery.

That's a weird thing to say. In a datacenter you want your disks to be fail-fast. I.e. you want them to drop out of the array early and cleanly.

Nobody "recovers" harddrives in a datacenter, you just swap them.

The worst failure-mode (that is still way too commonly seen) is that of the "lame drive" that grows bad sectors, predictive failures and all that jazz over time, causing bus-resets, confused controllers and bringing the whole array to a crawl until a human steps in.

That said, my experience with SSDs in that regard isn't too good either. I've seen them fail cleanly but I've also seen a single failing Vertex lock up a LSI-controller hard...

Artagra · on Sept 19, 2011

I agree - you can't just look at brand, as there are often a lot of issues that are specific to individual models. Having said that, my experience selling OCZ drives (100s of drives) is that they all have a high failure rate :( The only drives I haven't had experience with are the Enterprise level drives, so I can't comment on them.

I think the reason for this is that OCZ largely uses Indilinx and Sandforce controllers, and while very quick, they have had more than their fair share of issues.

r00fus · on Sept 19, 2011

Consider this: the hard drive is the least reliable component of almost any modern PC/laptop. Add to this that it likely contains the most valuable non-commodity asset: your data.

Anyone not backing up their system is really asking for trouble... which will happen. Given the ease of use of modern backup systems, and their cost (free only costing a modest amount of your time), everyone should be doing it. OSX and Win7 make you feel guilty for not doing it (though OSX's version is better, both deliver on basic backups).

All this said, the difference between an SSD and an HD is about zero when it comes to real reliability. Both will fail at odd times, and you should have a backup, preferably bootable, to get you back to good. An external drive with a system-imaged startup disk (free for all major desktop OSs) is quite cheap to maintain.

rue · on Sept 19, 2011

> Consider this: the hard drive is the least reliable component of almost any modern PC/laptop.

Is it really? Completely anecdotally I've never lost a HDD, but keep going through power supplies… it'd be nice to see some data on component failure rates. I assume some exists, but I haven't really felt strongly enough about it to go look.

drzaiusapelord · on Sept 19, 2011

I've worked in IT for several years. My own fail rate is one or two drives per year in a 50 computer environment. So in a 150 computer environment its 3 to 6 a year. That's desktops/laptops.

In servers its a completely different game. Thanks to A/C and steady loads its much, much better. There is a chance of getting a run of bad disks and suddenly having multiple failures a year, but only on that specific model. Generally, I'd say that rate is closer to .25-.5 fails for every 50 drives per year, if that. So over 4 years I can expect one or two drive fails on 50 disks.

Regardless, drives fail all the time on desktops and laptops. The reliability is a huge, huge problem. Supposedly, SSDs were going to fix this, but their teething problems are probably making them worse than spinning disks.

jsnell · on Sept 20, 2011

It stands to reason that people would think that hard drives are the least reliable component. A bad hdd tends to be very obvious, and for anyone with bad backup hygiene also memorable. Something like a flaky ram stick just makes the machine crash a little more often.

That said, on my desktop I run a raid-1 setup. When a drive fails, I immediately replace the whole pair. Just lost the third one in 6 years this Saturday. And there isn't even any infant mortality distorting the stats.

mikeash · on Sept 19, 2011

Completely anecdotally, I've had a few hard drives fail (some spontaneous, some due to physical abuse that solid state hardware would tolerate) and never seen a power supply go. I suspect that power supplies are something where cheap ones will fail a lot and good, expensive ones will outlive the person buying it. Hard drives, on the other hand, have a decent risk of failure no matter how much money you spend.

kabdib · on Sept 19, 2011

Never lost a hard drive (in ~ 15 laptops over 20 years). Not once.

I keep losing mechanicals (lid hinges) and other mechanicals, though.

viraptor · on Sept 19, 2011

> and their cost (free only costing a modest amount of your time)

Unfortunately it is not. You pay for it either in bandwidth and service price (or in reliability of free service) or by buying a new harddrive.

Cost of a 500GB drive is rather far from "free".

Someone · on Sept 19, 2011

Given the ease of use of modern backup systems

They have improved, but I do not think we are there yet. For example, I do not know of any system that regularly verifies backups. Time Machine, for example, could lose a directory of photos from five years ago without noticing it. If, then, your main disk fails, The photos are, for all practical purposes, gone forever.

wglb · on Sept 19, 2011

I disagree. In my collection of always-on 30 machines in my basement, over the last three years I have had probably 10 power supplies die, and only two hard drives. The hard drives that died--two 500 gig drives--died in the same month after just two years. I don't keep spare hard drives, but i do keep spare power supplies.

r00fus · on Sept 19, 2011

Interesting anecdote. I would argue that your always-on machines are not similar to an average PC/laptop.

Perhaps your situation is more analogous to servers (most mid-high end servers come with redundant power supplies).

wglb · on Sept 21, 2011

In the last enterprise environment I worked at, users were asked to leave their machines always, largely for software delivery and other updates,

Probably what is different about my setup is that my cpu meters mostly run at 100%.

gnaffle · on Sept 19, 2011

I'm finding it hard to believe that the difference is about zero. I think it depends _a lot_ on the usage pattern, and while your statement might be true for the majority of users, some people _are_ experiencing reliability issues with SSDs: http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid...

This is only reasonable to expect when one technology has a write cycle limit and the other does not. A hard drive will give better reliability in write intensive applications, while an SSD will beat a hard drive in mobile/portable applications.

Of course, for most people buying an SSD, the increased performance easily outweighs any reliability "cost".

JoeAltmaier · on Sept 19, 2011

tl;dr SSD failure rates are no better than most hard drives. Intel failure rates only report validated errors(returned drive fails in Intel test), real rate probably 2-3X higher. SMART doesn't work for SLC SSD (doesn't detect failures early enough to recover). Update your firmware often as failures initially were mostly bugs and not write-failures.

kalleboo · on Sept 19, 2011

All these statistics come from server use, where drives are constantly spinning and kept at a relatively constant temperature. I'm curious if there's a bigger difference in laptop computers, where the drives see a lot more physical abuse, power cycles, temperature variations, etc.

mawelsh · on Sept 19, 2011

I haven't quite been able to grasp why Toms Hardware, which I view largely as a consumer/enthusiast review site would choose to analyze the failure rates of SSD's in a datacenter environment. How about going to IT departments that have "floater" laptops and seeing how long the drives hold up in those? I think the primary consumer reasons for purchasing SSD's are increased battery life and physical fault tolerance in laptops not the expectation that they'll outlast a standard HDD in a server setting.

dspillett · on Sept 19, 2011

> I haven't quite been able to grasp why Toms Hardware, which I view largely as a consumer/enthusiast review site would choose to analyze the failure rates of SSD's in a datacenter environment.

It could just be that there is not enough good data from the consumer market to make solid conclusions from. DCs use drives in large numbers, so you are going to get "concentrated" readings.

azulum · on Sept 19, 2011

considering how much randomness can clump failures and successes, a small set of consumer used stories with hardware from different vendors, or even disparate IT departments with their "floaters" will leave you with a result you can't be sure of. the first company they mention had maybe 100s of drives and hadn't had a failure yet. it was only the larger companies that had reliable data. what might be interesting is the data from apple's ipods etc.

mrb · on Sept 19, 2011

Why is the IT industry so cautious about SSD reliability? We have spent decades developing HDD fault-tolerance mechanisms or processes such as RAID and backups. We should trust them.

r00fus · on Sept 19, 2011

RAID doesn't play well with TRIM from most SSD chipsets... it's not a panacea.

As SSDs are maturing it looks like there are two distinct markets where hard-disks can't go:

1) In servers, vendors offer cards that tie directly into the PCIe bus, bypassing SATA (

2) For laptops, one can look at the Macbook Air and it's SSD chips. Right now they are connected via SATA, but if that becomes a limiting factor, I see how that would be bypassed as well.

For now HDs and SSDs are very similar, but SSDs (which are essentially flash memory) can and will go where HDs cannot. They are quite different.

viraptor · on Sept 19, 2011

We're currently trusting a drive that might fail in 5 or so years in one specific place. The rest is recoverable, the place of failure might be overwritten anyways, the data might not be needed, smart will warn you that the problem is close in many cases...

With SSD you have a chance of going completely blank. If it happens to be a firmware issue on a RAID mirror, there's also a chance of common fault in both drives. Even if you have backups in that case, do you really want to deal with such situation?

rbanffy · on Sept 19, 2011

> smart will warn you that the problem is close in many cases...

According to Google, about two thirds of the time, SMART will warn you.

> there's also a chance of common fault in both drives.

Never, ever trust a RAID array made from identical disks. Whenever possible use different manufacturers, different models and different batches. Whatever caused failure of one drive will eventually cause the failure of its twins. If all twins are in the same array, you won't be happy.

Artagra · on Sept 19, 2011

Because even if you have rock solid RAID and backup solutions (and unfortunately, many, many companies don't), there is a cost involved in replacing drives. Especially when you are talking about an end user. Yes, most servers have RAID and you can do a swap out with no down time, and at minimal cost. But if you have an SSD failing in a users laptop / desktop, the down time and frustration involved can be major.

iamelgringo · on Sept 19, 2011

I talked with Greg Lindahl, the CTO of Blekko about their infrastructure. He came into the office one day with 700 SSD's and said, "Here's our new storage back end." Their search index is stored entirely on SSD's.

They haven't had a single SSD failure since. Granted, their search index is read only.

ck2 · on Sept 19, 2011

I remember reading early on when SSD first came out where people claimed that when SSD fails, it fails into a read-only state so at least you do not lose your data.

But apparently this is not true and it's not how SSD fail, at all.

What's crazy is I have not had a hard drive fail since we passed the triple digit mark for GB capacity. Last one was a 20GB drive (ah the old days).

Symmetry · on Sept 19, 2011

That's what happens when the flash memory wears out, but really SSDs haven't been around long enough for anyone but the most intense users to wear out the flash memory of one. Instead the problems people have run into are in other parts of the SSD - mostly firmware.

dspillett · on Sept 19, 2011

I've had a few go bad in recent years, ranging from 160Gb to (most recently, unless you count a 320 that was on its way out according to the SMART readings and was pulled and replace before failing) a 500.

They've been from a range or manufacturers (and a range of ranges) with not enough of each to consider a pattern.

smackfu · on Sept 19, 2011

Reminds me of CFLs. In theory, they are much more reliable than incandescent. Practically, they often stop working just as soon as the old light bulbs would, and they cost significantly more.

dlsspy · on Sept 19, 2011

I wish people would compare these with IO operations instead of wall time. SSD is a huge win there, and that matters the most for a lot of people.

WettowelReactor · on Sept 19, 2011

IO has a tangential benefit as well when it comes to recovery. For some large disk drive raids (1tb+) the rebuild time has gotten so large that even with an available hot spare you have a considerable exposure window during a rebuild. I believe this is what helped drive the push of RAID 6 over RAID5. In theory you still had 1 disk of protection during a rebuild. Although SSD's are primarily champions ins random IO and not the linear read/write of a rebuild they still have the potential to reduce this window of risk.

abduhl · on Sept 19, 2011

This is the main conclusion at the end of the article: that SSDs should be picked for their performance rather than for their reliability, as SSD reliability hasn't been proven to be better whereas performance has. Reliability comes in here because you can replace 4+ HDDs with a single SSD and thus reduce energy/heating costs and also increase reliability as you only have 1 drive that can fail vs 4+.

rbanffy · on Sept 19, 2011

You should also consider the reliability of a drive decreases at the same rate the data it holds gets more important. The more important the data, the less you should trust your disks (and do better backups, better mirroring, better checksumming...

davidw · on Sept 19, 2011

Be careful about your answer, or the avenging angel Murphy will smite thy SSD or Hard Drive down.