I worked on SSD firmware for quite a long time and here is my perspective.
Early flash used to fairly reliable with almost minimal error correction. However with increasing density, smaller processes and multi level cells, it has gone progressively less reliable and slower. Here are some of the things that we need to worry about: https://www.flashmemorysummit.com/English/Collaterals/Procee...
To compensate for all these deficiencies, the SSD architecture and hence the entire FTL becomes very complicated because any part of it can become damaged at any time. We always have to have backup algorithms to recovery from any scenario. Its difficult to build algorithms that can recovery from arbitrary failures in a reasonable time. I cannot have a drive sitting around for 20 minutes trying to fsck itself.
Another problem is that the job while rewarding is not very lucrative. The chance of a multi million dollar payoff for an employee is low. I have a higher chance working on a web connected gadget to become a millionaire. So that means it is really hard to recruit those who are top notch programmers who known how to figure out the algorithms, write the code, debug the hardware. Most new grads these days are interested in python, javascript and machine learning.
> Another problem is that the job while rewarding is not very lucrative. The chance of a multi million dollar payoff for an employee is low. I have a higher chance working on a web connected gadget to become a millionaire. So that means it is really hard to recruit those who are top notch programmers who known how to figure out the algorithms, write the code, debug the hardware. Most new grads these days are interested in python, javascript and machine learning.
That seems unfortunately true for most low-level infrastructure software. I am very good at those things, and have a medium amount of experience of building web software. Yet I can find a lot more jobs on the latter domain, and they probably won't pay worse (most likely even better).
Besides pay the non-rewarding things in this domain are mostly that the high complexity is often not properly understood by management, rewarded or taken into account in planning. Which then leads to suboptimal products that have been delivered under time pressure.
Infrastructure almost always becomes a commodity, or at least moves behind the scenes. For it to be lucrative, interesting technical position, you have to have a large-scale enterprise (like Google), serving millions of people, and you have to promote it as a worthwhile career path (also done at Google as the SRE position).
Let's be honest, the chance of a multi-million dollar payoff for an employee is low regardless. Young entry-level devs are optimistic and also vulnerable to believing a line of bullshit on how much their options might be worth some day. I do agree it is a "higher" chance in web/mobile technology, sort of like how your chance of winning the lottery is "higher" if you buy 10 tickets instead of 1.
You still get paid a lot more working at google working on generic backend protobuf shuffling than you will working on SSD firmware at a hardware company or intel's C++ compiler.
For those doubting you, the going rate for embedded engineers out here in the Denver area where a lot of these SSD controllers are designed is ~$90k. Embedded engineers get peanuts for some reason.
I've noticed the same thing, and suspect that it's related to the way that higher level software scales compared to embedded.
If a line of code is written to run in a customer's browser, then that line of code may be deployed to millions, maybe billions, of customers. But, if an equivalent line of code goes in to firmware for some widget, then you're doing pretty good to get that line in to a million widgets at all, and it's going to take a lot longer too.
I think it's a issue related to the visibility of the quality of the work. If the product is even 20% more reliable (whatever that exactly means), hardly anyone will actually notice. Nobody notices the absence of an error. That's although it probably took a huge effort to achieve this. Making a user flow just a little bit nicer is very visible and gets attention.
I always felt that ISPs suffer a similar problem. Nobody cares if everything works as expected, but we'll get upset of it doesn't. However, there is hardly anything outside of what we take for granted they can do that we will actively appreciate. What could a embedded engineer working on SSDs do that will be noticed, appreciated and not taking for granted by customers?
If a software product takes off, that can happen incredibly quickly, and the new product is primarily composed of code. If a hardware product takes off, the change can't be nearly as fast as it's bound by manufacturing, and the code is just one component in each thing.
I happen to be working on firmware for a VOIP phone today - we'll end up making N million of these things, over some number of years. If I were working on an Android app with similar functionality, that app could conceivably go to N million people tomorrow, or 10N, or 100N...
Anyway, I don't think I have a particularly clear or concise (or even correct) argument here, but it's the only way I've been able to rationalise what we've observed.
I agree with your comment about scaling, but I think you have underestimated the firmware deployment figures.
Ultimately, whatever we write that goes into firmware is hidden from the customer. The customer pays a price per unit, and other hardware vendors are competing against your product. This competition keeps the overall cost low. Except at the very top level like Intel or Samsung, semiconductor manufacturers seem to be fighting one financial crisis after another.
Competition does not work like that in Software. The products (even in the same domain) are all too different from each other, so though they may be competitors, they are rarely in direct competition.
Nah, it's just management incentives with nothing on liability side for the product countering that. They keep labor costs low. The company still makes piles of money off the product. There's no liability for devices, esp cheap, that fail randomly after such and such period of time. So, no negative outcomes for keeping labor cost down for firmware. So, they keep doing it.
In Xbox, many of the firmware hires the hardware folks made seemed to be payed poorly. Also, there were tons of contractors, and not much institutional knowledge was retained. (At one point they had to pay a consulting firm to decompile the firmware for a controller because they had lost the source code).
"Why do we need source control? It's all there, right on my laptop. Source Depot is just a bunch of trouble." [rough quote from memory, maybe conflated from a couple of engineers]. I'm happy to report that things got better.
On the software side of Xbox the people were much better compensated, and we wrote lots of firmware, too. It was probably harder to be hired, though.
"Why do we need source control? It's all there, right on my laptop. Source Depot is just a bunch of trouble."
I've threatened to withhold paychecks from employees who have said this to me. The job isn't done till the code is checked in, building, and backed up.
There's A LOT less people. But the demand is also lower than for people than can build simple websites. It seems like the latter is a lot more important than the first for setting the market rate on salaries. Or maybe it's about the fact that embedded work is often done be electrical engineers, which are seen in a different pay category in some countries.
I know what you mean--they could get paid a lot more elsewhere--but it still weirds me out that techies consider $90k/year "peanuts". I know people trying to raise children on one-third of that.
The hardware companies are not super profitable in the first place. Most just struggle to keep light on in their quest for hitting the margin bottom. Also, lot of these improvements don’t usually translate to higher sales because MTBF data is typically not even published for many consumer SSDs.
The solution for this would be to open source such projects so engineers from many smaller companies can collaborate. These companies needs to understand that they will not be able to attract top talent and collaboration instead of competition is the way forward.
If you took the doubling of productivity since 1975 and inflation adjusted the median income from then ($7750) you would get about $68,000. In practice wages haven't risen against productivity and GDP so the real figure is $45,000.
To make it look even worse, the GDP tripled but population only rose 50%, so if you adjusted wages against total economic growth instead of productivity median should be around $110,000. Median.
So yes, $90k is pretty much peanuts for how much money an embedded engineer would be on average making for their employer.
Users and administrators almost certainly prefer a 20 minute IO latency over data corruption. Host operating systems should probably flag an IO as failed long before 20 minutes and then you know 1) nothing made it to disk and 2) have some chance to avoid introducing additional corruption, if e.g., the OS is smart enough to kick out the drive when this happens.
> Another problem is that the job while rewarding is not very lucrative.
Do you mean it's lower paying than typical bigcorp software jobs outside of FAANG, or just that there aren't a lot of startups with astronomical valuations in the media FTL space?
Last I checked it was nearly twice as lucrative to be a Ruby-on-rails developer than an embedded engineer.
Embedded also attracts a certain type of engineer, usually very smart and able to manage extreme complexity with attention to detail but at the cost of anything resembling readable, let alone maintainable, software. The fact that anything at all works in the modern world is amazing.
I left the embedded space and have never looked back.
So $SALT_MINE, a highly profitable privately held Fortune 500 just decided to revamp their pay scales. They've now decided that they want to be at the 50 percentile remuneration-wise in the durable goods sector. That is the white goods sector, washing machines and such.
I predict we will lose all of our engineers - embedded dudes included.
May I ask how did you manage to leave the embedded world and where did you go after? I'm asking since after investing 6 years into this field, which I love, and jumping a couple of companies I realized the market(Europe) is really bad for this gig. Not only is our work highly challenging it's also poorly paid while at the same time our CEO is crying to the local press they can't find devs(to work for peanuts) and is forced to look for them in Asia.
I just went to do something else. I'm definitely a generalist, and now work mostly with "big" data.
I had zero embedded experience when I started doing embedded dev, and then I had zero data experience -- but a surprising amount of general experience is applicable!
"Embedded also attracts a certain type of engineer, usually very smart and able to manage extreme complexity with attention to detail but at the cost of anything resembling readable, let alone maintainable, software. "
Heaven for generalists that always love doing new kinds of things. Once I learned about it, I knew I probably should've done embedded instead of security research. Of course, now there's significant interest in overlap. Might not be too late to learn all that stuff after all. :)
As someone that identified a bug in a Drobo firmware once and was offered a job on the spot I think the problem with attracting talent is two fold.
The first problem is really two parts, not only is it rare to find people who have passion for storage related technologies but very few will gain exposure to these technologies to develop that passion.
Kids don't routinely grow up with a SAN in the house. They do tend to grow up with lots of internet connected consumer caliber devices and can easily gain exposure to working with these technologies.
I was fortunately able to explore this type of technology in depth because a family owned business let me tinker with their server equipment in high school.
After college I then co-founded a startup back before the cloud became big. That meant we needed to make use of old hardware to provide service to our customers at a price point that made our service profitable. Old drives were not a reliable way to do that. New drives were extremely expensive for old servers back in the day when SCSI was the interface that you expected for a server. We had to get creative and play with JBOD devices. ZFS was an amazing tool for us in those days, and it still is for anyone who wants to tinker.
The other aspect is that while these skills are valuable for creating a "job" they do not have potential for creating "massive wealth". Why learn about storage if you aren't going to be part of the first 10 employees at a company that has a $10B exit? Let Amazon and the other cloud vendors worry about that stuff.
Knowledge is power though. I recently came across an AI startup that I'm now helping. They were spending significant money using GPU computational power to provide artificial intelligence training through a cloud provider. They blew through about $300k in credits within the first year to give you an idea of how much money that type of power can cost.
I am now helping them cut over to their own co-location facility. The first year alone they will save so much money it will pay for the next three years.
Reading that helps reinforce the idea that no matter what path you are on in the field there is potential that some random thing you learned about an SSD firmware helps you optimize some growth stage companies product and ultimately that helps you build wealth.
>very few will gain exposure to these technologies to develop that passion.
This, exactly.
SSD firmware is opaque, hard to learn from outside. On the other hand, trending web-based framework has all the source code opened, with great documents, and ready to use tools. No wonder young people of today find passion on other things rather than SSD.
I'm mostly curious because I work in a storage-adjacent field (NAS) for a BigCorp and the pay is pretty good, if not quite FAANG level. It's not a startup by any means, but I will easily become a multi-millionaire in a handful of years. I was curious about the other side of the fence.
Huh? If you're going to "easily become a multi-millionaire in a handful of years", then your pay is more than pretty good, and certainly not worse than FAANG level.
Sorry, handful of years from today. I've been working for 7 years now. FAANG comp would probably be 20%-25% higher; I'm mostly good at keeping my expenses down and saving a high proportion of my income. I've also had the good fortune of the bull market working in my favor for the entire time I've been employed.
Even at that pedestrian level, if you pack them tightly enough I could imagine fitting a few billion years at least into the volume of an average-sized handful!
Sorry, handful of years from today. I've been working for 7 years now. FAANG comp would probably be 20%-25% higher; I'm mostly good at keeping my expenses down and saving a high proportion of my income. I've also had the good fortune of the bull market working in my favor for the entire time I've been employed.
It's been about 5+ years at this point so I don't recall all of the details. I can recall that they offered me a job after I pointed out the bug. From what I recall, I politely declined but I helped them test a beta firmware for a while.
>The other aspect is that while these skills are valuable for creating a "job" they do not have potential for creating "massive wealth". Why learn about storage if you aren't going to be part of the first 10 employees at a company that has a $10B exit? Let Amazon and the other cloud vendors worry about that stuff.
That seems like a completely ridiculous way to try to organize your life. Almost no companies have $10B exits.
> Users and administrators almost certainly prefer a 20 minute IO latency over data corruption.
If the drive part of a RAID setup I would actually prefer it just reports itself failed and doesn't slow down access to the array by scanning itself for 20 minutes.
As far as I know, that is one of the main differences when buying enterprise or nas drives compared to consumer drives. With nas drives, the firmware gives up very quickly since the drive is assumed to be part of an array with redundancy. Consumer drives will retry reads for a very long time before reporting i/o error.
This is a bit of a misrepresentation. The only flash that never had ECC was NOR. Some embedded systems had NOR, but I'd be incredibly rare to find a consumer SSD with NOR.
When designing a NAND memory product, you aim for some max allowed error rate. You choose the error correction algorithm based on that target. Error rates for NAND products are precisely what the designer intended.
Because SSDs are so large and there is such a large number of them, errors can still occur (at a known rate). FTLs should take that into account. Critical data structures can add checksum redundancy which can reduce the error rate for those to an even lower value, which is usually necessary anyway since power disturbances during erase or program can cause programming errors.
There are of course a number of patterns that increase error rates that FTLs have to be programmed to prevent. Encrypting is the first important step, since it makes the data unlikely to be uniform. The second is mitigating read disturb.
I don't mean to imply NAND flash never had ECC. Just that the early flash had needed maybe 1-3 bits of ECC and I would never see an 1-bit error until maybe a few months of use. Things were a lot easier back then. Scrambling and read disturbs were not a issue until later on. All the things you suggest can be done. Just so many opportunities to screw up in the implementation despite the best efforts.
I'm surprised no one has mentioned the real difference here. In most startups employees are granted equity. They are closer to worker collectives than they are corporations. In corporations only the capitalist gets any return, they are extracting value from their employees.
The point being, the solution to this sort of systemic problem is for more corporations to become worker owned. If the guy writing the SSD algorithms has a say in governance and a cut in the profits they will want to stick around. It's the stable version of startups.
Sorry to spin off topic, but perhaps having the chance to speak to someone that's worked on SSD firmware for the first time...
is there any feasible way to recover data after a TRIM command has been issued that you can think of? Is there any way to trick the firmware into not returning 0's when reading the blocks of a deleted file? Mostly interested in doing so for Apple
TRIM destroying the entire data recovery and forensics market seems like such a big deal, I still can't believe it although it started years ago
Are you saying that TLC SSDs are essentially unreliable and it is a miracle we don't see higher failure rates?
Regarding your point regarding languages: I would be interested and motivated, but I probably don't have the skill (probably, only did some x86 ASM/C++) nor the location (Europe). Usually someone doesn't just start with C++ but with a managed language, and once someone lands a job becomes demotivated or just doesn't have enough time.
It's not any harder to recruit for that role than any other bare metal, embedded position where new grads are expected to be top notch programmers who know how to figure out the algorithms, write the code, debug the hardware, and bring decades of experience. IOW you're recruiting for unicorns and the pickings are slim.
> Another problem is that the job while rewarding is not very lucrative.
Why is this a problem?
Let's not forget, we live in a capitalistic society. The job of the capitalists is to exploit labor. The cost of labor is a direct result of the market demand (or not) for a more complex or, by your point, a more robust product.
It need not simply be a market reaction, though. A vendor can create the market for a more complex/robust product. But still, their job is to exploit the available labor so if the people capable of such are available at a lower cost, so be it.
> I have a higher chance working on a web connected gadget to become a millionaire.
As others have noted, perhaps true but the numbers are so small it may as well also be zero. However, as you should now be pointedly aware, the perception doesn't match reality.
Question: do new grads (or let's say up to 3 years experience) working on SSD firmware, making $90k, have to clock in 60+ hours / week and live in insanely high cost regions where their $120k startup salary qualifies them for housing assistance programs and they have to live with at least one if not more roommates? Or is $90k (per another comment) quite good in relation to the total picture?
IOW, my guess is that you are not competing for talent on salary or total comp. My guess is you are competing on general industry attractiveness. The entire mindset around embedded vs web/consumer programming is different. So I think to frame it as a compensation problem is wrong, and by misframing it you will never "solve" it.
Not that spinning HDDs are really any different, but SSDs are a perfect example of an entire computer that you attach to yours, and speak with through one of the (many) storage-oriented protocols. The device itself is a black box, and complex transformations take place between the physical persistence of the data and the logical structures that are exchanged on the wire. There are many layers of indirection, and many things that can go wrong, from fault with the underlying physical storage, a physical fault in the controller, or a logical (software) condition in the controller that puts it in an unrecoverable state.
Spinning platter drives have parts that form a more relatable metaphor to humans' notions of wear and tear: skates of magnetic readers flying on a cushion of air above a rapidly rotating disc, with the gap separating a few dozen nanometers, often smaller than the process size in the controller's silicon. They have arms that can move the head over a particular disc radius, and a motor that spins the entire stack of platters. These mechanical components exhibit wear proportional to their use -- this makes intuitive sense, and is also recorded in the SMART attributes, so drives in old age and of many park cycles can be replaced preemptively before they catastrophically fail.
SSDs are missing many of the usual mechanisms that would contribute to physical wear leading to sudden catastrophic failure in advanced age. This means that irrespective of their failure rate vs. HDDs, a higher proportion of their catastrophic failures are the fault of the controller. This is discouraging: essentially, the "storage layer" is now quite reliable, so the fallibility of the human-programmed controller is brought into light.
SSD's FLASH has plenty of wear and tear. It just has a different format from what happens to mechanical products. It's more similar to some metal that gets scratched on use, and will rust faster or slower depending on the amount of scratches.
What we have is that the software is currently less reliable than the memory. There is no fundamental reason for that, it's just that manufacturers put a huge amount of engineering work on reducing the wear, and not so much on programming practices.
> skates of magnetic readers flying on a cushion of air above a rapidly rotating disc, with the gap separating a few dozen nanometers, often smaller than the process size in the controller's silicon.
Complete aside, the fly-height of a magnetic head is actually fractions of a nanometer (i.e. hundreds of picometers).
EDIT: I got this from a talk by Bryan Cantrill[1]. The fly-height is allegedly 0.8 nanometers (800 picometers).
1. I can't find a source that says less than a few nanometers.
2. 300 picometers is roughly the diameter of a helium diatom. The head cannot possibly float through hydrodynamic means if an air molecule can barely even fit under it.
> The head cannot possibly float through hydrodynamic means if an air molecule can barely even fit under it.
It can. Since siblings liked airplane analogies, here is another one: Consider the head to be an airplane. It has somewhat wing-similar features which provide a lifting force, but the actual read/write head sits below those features (like, say, a landing gear is below wings).
You can at least fit several layers of iron atoms into that gap, assuming they're part of the crystal lattice. So the surface does not have to be perfectly smooth.
This seems implausible at first glance. Even making objects flat at the nanometer scale is quite difficult. When you also introduce movement tolerances (e.g. rotation disk is not perfectly aligned), this seems quite extraordinary.
(Sheepishly) My only reference is a talk by Bryan Cantrill[1] where he states that an exec at a hard-drive manufacturer stated it was ".8 nanometers". I will try to find a better source than that.
EDIT: There is a paper from 2016 which did an analysis in the difference in the flying height of the head during different operations, and it was measured in Angstroms[2]. I couldn't find one that actually gives a precise value of the flying height. There is a 2011 paper which states that some system they were testing allowed for 4-9 nm flying heights[3] which is about half-an-order-of-magnitude larger than the claim -- but that's already 7 years old.
Disk platters rotating at 7200 rpm are both perfectly smooth and perfectly aligned.
The read head is something like a jumbo jet flying a handful of feet above the (perfectly smooth) ground. It's really crazy how close these things are, moving very fast. And why accelerometers are a significant feature.[0]
> Can you provide some references?
Wikipedia claims[1]:
> In 2011, the flying height in modern drives was a few nanometers.
and
> The "flying height" is constantly decreasing to enable higher areal density.
> At 7,200 RPM, the edge of the platter is traveling at over 120 kilometres per hour (75 mph)
So you've got a read head flying at 120 km/h => 33 m/s => 33,000,000,000 nm/s at a height of 3nm or less. Picture that!
E.g, a 757 typically cruises at 858 km/h => 238 m/s. So picture your 757 flying at an altitude of 21 nm and that's the metaphor, kinda. The read head is a bit smaller than 1/7 of a 757 jet, obviously.
The claim was hundreds of picometers, which is an order of magnitude smaller than a few nanometers. Literally nothing can be perfectly flat, nor perfectly aligned.
It doesn't need to be. The basic idea is that the fly height is self-regulating; if the head goes away from the platter, its "lift" is reduced, so the springiness of the arm forces it back to the platter. If it moves closer to the platter, lift is increased, so it moves away. Similarly the tracks don't have to be perfectly round or concentric, because the lowest-level head control system in the drive actively tracks the head's current disk track on the surface; head movements aren't a "rotate 11.57821° to track 28139123", but rather "track 28139123 is around here somewhere, lets find it".
Yes, I understand how feedback control systems work. The question was whether they really have the fidelity to do it at hundred picometer resolution.
I am dubious because this is a pretty incredible feat. This is the length scale at which atom-atom interactions become important. That implies that the crystal lattice structure of both the read/write head and the underlying platter will affect the dynamics of the system!
Nope. Since a long time, tracks are virtual (array of raw track IDs as a function of sector ID when the arm is held "constant" and averaged for X thousand RPMs as part of the per-drive firmware Calibration at the factory).
How does the second sentence produce the conclusion "nope?" If the arm is held "constant" and the platter is spinning at 7200rpm, it's going to trace something very close to a perfect circle.
The 3nm figure is from 2011. 0.9nm, or 900 picometers, is maybe not an unreasonable progression from 3nm in 7 years. And — it's perfectly flat relative to everything else involved.
I don't really agree with most of this comment. HDDs are also inscrutable black boxes; many of their failures are controller, rather than media losses; and SSDs also report SMART attributes that are predictive of failure. It's certainly possible HDD vendors have done a more successful job of convincing buyers that failures are attributable to the media rather than the controller, but utilizing the media fully with shingled recording and HAMR and all that jazz really requires a similar degree of controller complexity as an SSD FTL controller.
> These mechanical components exhibit wear proportional to their use
Actually the spindle isn't touching anything any more, because the spindle/rotor (one part) is supported by a fluid bearing; it basically floats on a thin layer of oil. If the spindle touches the bearing at, essentially any speed that isn't zero, the bearing surfaces are damaged instantly and the resulting burrs and debris will degrade and lock up the bearing very quickly.
I believe the only rolling-element/contact bearing used in modern disks is the pivot bearing of the arm assembly.
> but SSDs are a perfect example of an entire computer that you attach to yours, and speak with through one of the (many) storage-oriented protocols.
These days most of what we call a computer could be described this way. Even your compiled machine language is ultimately far more abstracted from what the processor actually does than it was on, say, a 6502.
> Even your compiled machine language is ultimately far more abstracted from what the processor actually does than it was on, say, a 6502
Funny that you bring up 6502; that reminds me of 1541 disk drive for C64, which had mostly same 6502 as the host computer (albeit running at slower speed).
Most disk drives of the time were like that. One of the reasons the Apple II disk drive was so affordable is that Woz just used the Apple II's own 6502 to handle the grunt work, with their disk drive being little more than a drive mechanism, a PROM, and some ICs. Since this is The Woz we're talking about, he went ahead and broke with conventional encoding while he was at it and instead implemented a scheme which allowed for a few extra sectors per track.
It might be an interesting exercise to see how many peripherals are connected to your PC right now that have much more computing power than an 1Mhz 6502.
The Apple ][ disk encoding was not done to allow "for a few extra sectors per track". Since the Apple ][ did not have a disk controller, the encoding from magnetic flux to bits was handled in software. A 1Mhz 6502 cannot handle the "standard" encoding (things like M2FM) in software as the disc was rotating. Woz's encoding allowed the encoding.
The 6502 in the 1541 runs at 1MHz, that is indeed a bit slower than the NTSC version C64 CPU (1.023MHz), but a bit faster than the PAL version CPU (0.985MHz)
This reminds me of a story that a lecturer told us at University. That at one point they distributed computations to the drive controllers of the connected (fridge sized) disk drives because they could do so much processing while waiting for the platters to spin into place.
So the whole story of a disk being a computer has been true for a long time.
"When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen"
Why shouldn't it? Isn't it just hardware too?
"With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that"
Why can't you do the same with SSDs?
It feels like the author's main complaint is the frustration of not understanding SSD hardware as well.
Is this a valid complaint? Are SSDs magical in some way? I'm not an expert but... It's just hardware with pieces that do stuff. Why can't we come up with an understanding of why it fails?
"It feels like the author's main complaint is the frustration of not understanding SSD hardware as well."
What is so frustrating about SSDs is how very poorly they compare to previous incarnations of solid state storage.
Using Disk-On-Chip and/or IDE-pin-compatible CF cards, I had many, many devices in the field that lasted, mounted read-only, for decades An entire sect of the computing industry came to rely on these parts as alternatives to spinning media that could not mechanically fail.
This is not the case with SSDs at all. They fail left and right, even mounted read-only, for all manner of complicated and interesting reasons. It's very frustrating that SSDs are not a step forward in reliability from spinning media and are a step downward compared to (for instance) a 16MB consumer CF card from Sandisk, circa 2000.
rsync.net filers, which need a boot mirror, are always constructed with two unrelated SSDs - usually one Intel part and one Samsung part - so that when the inevitable usage-related failure occurs, it does not occur simultaneously to both members of the mirror which have, being a mirror, been subjected to identical usage-lives.[1]
We shouldn't have to do that.
[1] I can't overstate this - if you need a RAID mirror, do not use identical SSDs for the two members of the mirror. There are many, many cases of SSDs failing not due to "wear" or end-of-life, but due to weird usage edge cases that cause them to puke ... and in a mirror, you give the two parts identical usage ... we either get two different generations of Intel part (current gen and just-previous gen) or we get current Intel and current Samsung ...
Every failure of an SSD feels like an exceptional event. Some harbinger of doom that needs to be shouted from the rooftops. The prions of storage.
But magnetic hard drives failed all the time. I have a giant stack in my office closet just from my dev machines over the years. But it wasn't new and scary -- it was just a hard drive failing -- so it was just normal. Some had controllers fail, suddenly blinking out of existence. Another had cache memory corrupt so it just gave ridiculous readings occasionally. Others had physical failures.
I don't know where to begin relative to prior flash memory (e.g. CF cards) which were absolutely notorious trash.
It is worth noting that every smartphone the world over has an "SSD" in it. We spend remarkably little of our mental power concerned about the flash storage. It is the cause of a negligible amount of device failures.
compared to (for instance) a 16MB consumer CF card from Sandisk, circa 2000.
That would almost certainly be small-block SLC flash rated for 100K program/erase cycles and 10 years of retention. The huge-block TLC now is ~1K program/erase cycles and 2-3 years of of retention depending on where you look (the manufacturers are not surprisingly quite reluctant to release these specs...)
I've heard very similar advice for non-SSD mirrors too. Use different manufacturers or, at the very least, use different batches of disks from the same manufacturer.
Most people use different batches but the same manufacturer, mostly because they remember advice from back when you could get a hardware raid card that would synchronize your scsi drives.
Also, though, because most serious RAIDs contain more drives than you can find manufacturers of hard drives.
On modern MLC/TLC SSDs, read-only mode doesn't really exist. The NAND blocks must be re-programmed after a number of read accesses to mitigate read disturb. If anything, mounting read-only is probably a corner case, stressing the firmware's read disturb mitigation.
The tradeoff from worse read disturb characteristics is NAND that is 100x cheaper per GB than in 2000.
Do you have any more details on the re-programming that would be occurring on non SLC flash cells, even if mounted in read only mode? This is something I was always concerned about too.
I don't know about that. I know most managers in embedded software go cheap on engineers and software assurance on purpose to get bonuses and such. The hardware side makes me think you haven't studied deep, sub-micron hardware much. I started looking into it a few years ago or so, just reading the slides on lots of stuff even though not understanding much of it. They helpfully put a lot in lay terms, though, with lots of comparisons. To say the stuff gets harder every time you shrink to a smaller node is an understatement, esp for solid state.
If anything, the modern flash should probably be considered broken right as they ship out of the factory. If not, the process nodes after 90nm or so just keep having more and more ways for individual components to screw up or change behavior across same wafer. Some happen instantly by design. Some happen later with aging. The memory technologies are closer to that analog level of things than most with harder verification. The high-density flash on newer nodes uses less-reliable tech than most just to operate at that low cost. So, they add all kinds of firmware and hardware tricks to try to make it work for a period of time like it's a whole, functional unit of storage despite pieces of it misbehaving all throughout.
It's a nice, man-made miracle these techs even work at all. Those that last longer like you mentioned still exist. I'll add a comment with links to one type so you can compare price/storage/performance to these broken-by-design SSD's you use. I'll throw in another two that mention shrinking challenges so you can see what they face every time they have to upgrade or just deploy new designs in mixed-signal.
Yeah agreed 100%, for my RAID mirror setup I use drives from distinct manufacturers for that exact reason -- I can presumably expect a different failure rate (hopefully). :)
Well to be fair, there is an entire layer of abstraction at the SSD controller level that does tons of black box magic. It allows the OS to treat the SSD as another other storage device without letting the OS know what is going on.
So the combination of non-moving parts (making it hard/impossible to debug via physical inspection) combined with a tons of wear leveling/misc magic can defo make it seem like SSD's are magical.
Is there a good reason why we use separate SSD controllers instead of letting the primary cpu handle it? The obvious reason is backwards compatibility, but as more of computing moves to SSDs, is this still relevant?
ZFS has shown that removing layers of abstraction with regard to storage can be beneficial.
That adds a round of latency, and makes it pretty much impossible to boot off the drive. The blocks aren't in the same order in the Flash as they are presented by its interface, and one of the main jobs of the controller is to re-order them.
It would be an interesting product to have, a raw block API to a Flash device with all the temporary state stored on the host - but a hard one to sell, as it's not differentiated in any way.
The controller makes it possible to get a standard bus and a pre-installed driver and use it to access any kind of memory from any manufacturer and any technology. It's the kind of convenience that makes people buy hardware - it's the kind of thing that made SATA and USB win. The alternative is that once in a while you plug a driver into your computer and it won't work.
Besides, I don't think manufacturers want to release the best practices for using their memory.
I wonder if you can read any extra state out of the T2 as a result, e.g. more information on wear leveling, temporary read failures and so on, more than the standard SMART counters?
Thanks for this youtube link. In the case of recovering deleted files from SSD's that have the trim command enabled (Forensics Writeblocker was used,) the following drives have a low probability of recoverability: Crucial, Intel, and Samsung (3core controller). Whereas Seagate, Supertalent (parallel ATA to SATA bridge chip,) OCZ, and Patriot; files could be recovered. If the drive is quick-formatted, and if trim is enabled, the data is completely gone on the following drives: Crucial, Intel and Samsung. The trim state impacts the most whether or not the data can be recovered. You can check if trim is enabled with the command: fsutil behavior query DisableDeleteNotify - if your result is 0, it is enabled (default).
Because they don't die incrementally. Wtih a hard disk you'll get bad sectors, growing slowly over time. Or a head crash and then it's all dead.
What could cause an entire SSD to die at once? I would totally understand bad sectors, but the whole thing at once? Where it doesn't even try to read existing data?
One potential cause is poorly constructed drive firmware which fails to account for a minor failure of some kind and crashes. If that unaccounted minor failure is persistent, the firmware may crash constantly and then you'd be unable to access even theoretically good parts of the drive.
Firmware -- undebuggable, unobservable, unfixable software, jammed into your devices -- is the enemy.
Sounds like there is a business opportunist here for open firmware or even one running on the CPU especially for the enterprise? But maybe they won’t bother either despite valuable data and uptime, and just rather throw more redundant SSD’s at the problem.
A component failing? Electrical components fail. Sometimes it's a manufacturing defect, sometimes a design defect, sometimes something under or over-volted and it was enough to cause damage to any given component.
Could be an IC, could be a capacitor, could be a poorly laid trace. A poorly shielded RF source could even damage any number of components.
I mean, in theory a single high charge particle from that rare cosmic ray that reaches the surface of earth running full-steam-ahead through an IC could cause just the right amount of damage to make it fail although this would be an incredibly improbable scenario.
Same goes for HDDs, televisions, your clock radio, whatever.
There's a single wire in all computers, for some reason they always just have to include it, which if severed will completely disable the computer, as in, it won't even turn on, not even try to turn on.
That wire is the one delivering power. Point being, while things are so complicated that you often can have a lot going on and still try things, there are still single points of failure that can never be fully covered.
<i>Why shouldn't it? Isn't it just hardware too?</i>
In a mechanical hard drive, there are moving parts which can wear out due to friction, etc.
SSDs are solid-state, so it seems like at least theoretically, it should be possible to build one that keeps working for decades. e.g. I have solid-state hardware from the 70s and 80s which still functions.
I've always been a little mystified as to why SSDs' data areas wear out for that reason,[1] but that's a whole separate issue. The author of the article is writing about sudden failure of the device as a whole.
There are only two explanations I've ever heard for short lifetimes in electronics as a general industry.[2]
For devices manufactured after about 2000 there's tin whiskers,[3] which began to be a problem because of RoHS requirements. I'm not sure if that applies here, though.
If the device includes electrolytic capacitors, my understanding is that those generally have a finite lifetime as well, and it can be fairly short if they're poorly-made.
I'm not a hardware expert, though, so I'd be interested in hearing about other factors, and I'm sure the author of the article would too.
[1] I've seen lots of writeups of how wear-leveling works, etc., but never a good physical explanation of what is actually wearing out over time.
[2] Obviously there are other factors for specific devices, or specific designs. E.g. maybe parts of a specific device break over time due to thermal expansion and contraction if the device wasn't engineered to handle that properly.
> I've seen lots of writeups of how wear-leveling works, etc., but never a good physical explanation of what is actually wearing out over time.
SSDs are flash are basically EEPROMs. In an EEPROM one bit is stored in a dual-gate MOSFET. One Gate is a normal gate, the other is floating, i.e. it just is a small conductive island. The information is stored by (quite literally) shooting electrons through the insulation of the floating gate into it. They're then trapped on the gate; if you turn the second gate on, the transistor conducts iff the floating gate is also turned on. This shooting action happens to damage the insulation, which at some point is degraded enough that it can't keep the electrons trapped on the floating gate. Hence the electrons leave, together with your information.
> For devices manufactured after about 2000 there's tin whiskers,[3] which began to be a problem because of RoHS requirements.
The solder joints themselves usually don't form whiskers, which mostly grow on pure tin-plated surfaces, e.g. the pins of components (which previously used lead). Using lead-free solder is mostly co-incident with whisker risk, not the cause of the majority of problems.
The electrons aren't shot through the insulation. They end up in the floating gate by quantum tunneling. The insulation by its name is non-conductive, and must be so that the electrons stay put. The degradation happens through general thermal wear.
The exact wear mechanism doesn't really matter (you must be referring to "hot carrier injection") --- the point is that to record data, electrons are being forced through a material which gradually wears it out.
I just looked this up, and it turns out I was completely wrong about how the wear is caused. Apparently it’s due to a build-up of electrons remaining in the floating gate over time, and not general thermal wear. It turns out that heating can even reverse this wear. Explained in this link: https://arstechnica.com/science/2012/11/nand-flash-gets-bake...
On 1): It's because each cell is essentially a consumable with a limited number of state transfers. It's similar in that way to how a solid state accellerometer still has a moving part inside it that can break or wear out over time.
I understand that there is something that makes the cells stop working over time. blattimwind's reply is the first actual explanation I've ever seen of what that something might be.
i've never seen or heard of an automated board rework system. thinking about the steps and things i've had to do to manually rework boards, your machine would need not only be able to apply force to pull parts off of boards without damaging the board, but also be prepared to restore pads / through holes to usable states after desoldering parts, before new ones could go back in.
there's a reason companies don't repair circuit boards in consumer electronics.
I've done it myself, save restoring pads -- to me it's just the sort of fiddly, finnickety thing that it seems like robots sold be good at. And, they're not going to burn their fingers or run out of hands to hold things!
Robots are good at repetitive tasks and economy of scale. Doing a tricky action for ten thousand boards is something that robots are good at; doing a different tricky action on each of a hundred boards is not.
It doesn't seem plausible to have a business case where you'd get be able to get a large quantity of identical boards (that are otherwise good!), replace caps on them, and be able to sell them for much more than you got them - i.e. that the boards haven't become obsolete in that time. If there's no mass production, there's not much use for automation.
The post resonated with me because of a stupid bug I hit in a ca 2011 SSD. (Samsung?) After 100 power-on cycles, the drive would brick itself. Required a firmware update in time to avoid.
That defect doesn't strike me as being inherently related to SSD media, but really left a bad taste in my mouth with what might be going on in the development process to lead to such instability.
I think the difference is familiarity. We've been using hard drives for decades so kinda know what to expect. SSDs have only been widely used 5-10y, so we're all just getting used to know how they work and how they die.
A major problem with SSDs seems to be “firmware death” - where the flash chips are physically fine (or mostly fine), but the firmware (or firmware memory) has gotten corrupted due to some programming error, electrical glitch, or cosmic ray. I’ve had scores of older SSDs die after things like power outages and sudden shutdown events. This is super frustrating because the data is physically OK but the controller just isn’t responding to any requests anymore.
I wonder if there’s an easy way to distinguish a controller failure from a flash failure from the behavior of the device over the last few seconds/minutes of operation. In theory a controller failure should cause a fairly abrupt loss of service, but I’m sure there are soft lockup failure modes too.
I have seen some weird issues with SSDs. I had an OCZ Vertex 2 die on me multiple times, but one thing that stood out most is that after a power cycle or complete system shutdown (note: reboots were just fine) everything I had done last time - install software, update Windows, create files - was gone. The state was reverted to before the time it booted. It was like my computer contained some kind of Reborn chip, except it was the Sandforce controller malfunctioning.
Modern SSDs have incredibly large caches. For example, HP EX920, which is a TLC (triple layer cell) SSD, in its 1TB model contains a whooping 200GB of SLC (single layer cell) cache. It's entirely possible that your changes are simply in cache only, and hasn't been committed to the actual storage.
I don't know if I'm getting exactly what you're saying, but a sudden power off isn't going to wipe your SSD's SLC cache like it would do the DRAM cache on more expensive drives.
It's probably more like the superblock not being updated to point to the newest data. Instead it points to old data that hasn't been garbage collected.
The SLC cache is managed by the controller, it's not going to forget to write it to the flash cells being managed as TLC/QLC, the controller is transparently storing your data in both. The controller knows if put some data in the SLC cache and some data in the QLC, and that that data needs to move from SLC to QLC, and that that other data in the SLC was marked deleted so now it can return that entire block to being addressed as QLC.
I don't know. I'm just trying to say that since flash isn't volatile the SLC cache isn't treated as volatile cache either.
Like many early adopters, I too had a bunch of failed Vertex 2 drives and sometimes observed similar things. I think this might be because the drive lost some updates to its FTL about where it wrote new data, which could plausibly lead to both new files vanishing and changes to existing ones being apparently undone.
Yes, the Sandforce controller in Vertex 2 was a real unstable beast. I lost maybe three or four Vertex 2 drives. All under warranty, except the last one which suddenly vanished and was not detected anymore.
OCZ does not exist anymore. Not sure of if this was one of the causes, but I would otherwise never buy a drive from them again.
but the firmware (or firmware memory) has gotten corrupted due to some programming error, electrical glitch, or cosmic ray.
A lot of SSDs also store their main firmware in the same flash that is used to hold user data... this is something which was done with hard drives too (hence why dead/dying HDDs sometimes show up as a small drive with a weird name --- that's the "recovery mode").
a new controller is probably worthless to you, even if the old one wasn't storing the data encrypted. you don't have the map between real sectors and locations in flash.
your data is probably jumbled up one way or the other, but at least a custom board will let you read all of the underlying flash, instead of just the portions of it that a new controller would believe are in use. (keep in mind that the SSDs will have more physical blocks than they advertise logically.)
If that's the case, I wonder if controller failure could be prevented by ECC controller memory? Or would software failure recovery be sufficient to make a highly reliable controller?
With drives you used to be able to rip off a controller from an identical model and wire it back on to read the data.
With SSD or NVMe the controller isn't really a separate component you can just replace. Maybe it's possible to saw off the broken part and bodge-wire it to a working surrogate, but that would be extreme.
A tear-down of a broken SSD might reveal more about what could be done.
FWIW I switched entirely to using Samsung’s SSDs and haven’t had any issues. They seem to be above average for firmware quality, but I don’t have statistics to back that feeling up.
Sometimes the dead SSDs will respond to a handful of commands anyway, meaning that you can attempt a firmware upgrade or reset. That’s the lucky case. But in the majority of cases I’ve seen, the firmware/controller goes into some hard lockup where it no longer processes SATA commands at all. I wish these things had JTAG ports...
This is not a technological problem, it's a cultural one. These problems are easily fixed ("easily" by the standards of technical problems that regularly get fixed in other regimes). The reason they don't get fixed is that the customer reaction to failures like this is to rant at the mysterious storage gods that are making their lives miserable.
Needless to say, there are no mysterious storage gods. These are artifacts made by humans, and somewhere out there, there is an engineer who either understands why these failures are happening, or knows how to engineer these devices in such a way that when these failures happen, the cause can be determined, and then a design iteration can be done to reduce the failure rate and make the failure modes more robust. The reason this doesn't happen is that customers aren't demanding it. If major purchasers started demanding, essentially, an SLA from their SSD manufacturers, with actual financial consequences for violating it, you would be amazed how fast all of these problems would get fixed. But instead we vent our frustrations in blog posts and HN comments :-(
Physical devices don't have SLAs, because they aren't services. SSDs have the physical equivalent of an SLA, a warranty. I haven't seen stats, but in my experience the odds of an SSD dying within the warranty period is very very low.
If you want your storage to have an SLA, storage service providers exist, and will be happy to give you an SLA if you're willing to pay. But it isn't cheap.
Yes, I know that. That's why I said "essentially".
> SSDs have the physical equivalent of an SLA, a warranty.
There are two orthogonal issues. The first is what happens when a device fails. A warranty addresses that. The second is how does it fail. Does it fail all at once with no warning, no way to perform post-mortem diagnostics, and no way to recover the data? Or does it fail with a gradual degradation of performance and capacity over time, and in a way that, if/when total failure occurs, the cause can be ascertained and the data still recovered somehow?
>Does it fail all at once with no warning, no way to perform post-mortem diagnostics, and no way to recover the data? Or does it fail with a gradual degradation of performance and capacity over time, and in a way that, if/when total failure occurs, the cause can be ascertained and the data still recovered somehow?
how does that have any relation to an SLA? An SLA is a promise of a certain amount of uptime, with financial penalty for the provider if not met. A warranty is a promise of a certain product lifespan, with a financial penalty for the provider if not met.
SLAs have nothing to do with providing diagnostic information.
A warranty is like an SLA that only refunds you the percentage of time the service was down. It's not nothing, but it's extremely weak. They're more worried about annoying you than the pennies of lost profit, and your payout isn't nearly enough to make up for the trouble caused.
I've experienced a few seriously strange issues with modern SSDs, even some of the better ones.
I had a 512GB Samsung drive that became very slow randomly at doing IO operations, the whole machine would die for 10-30 seconds at a time once or twice a day while any process that tried to use the disk became blocked on IO. Then it'd come right back like everything was perfectly fine.
Issues like this definitely worry me, we're basically completely blind as to what those controllers and flash chips are actually doing. Not that it wasn't a similar situation with HDD controllers before, but at least it didn't seem as unpredictable.
I've recently noticed similar issues with my older Crucial SSD in my 2012 mbp.
In the past 3 months, there's been maybe 5 times where I started my laptop and it took 5+ minutes to finish booting. Normally, it's 15 seconds. Once I'm logged in, doing anything takes forever, but it does eventually load. A power off, and back on got it "working" again but who knows for how long.
I've had this drive for 4-5 years now, so I'm impressed it's lasted this long.
If on Linux, try running fstrim from time to time. More free space makes life of SSD's garbage collector / defragmenter much easier. I've anecdotally noticed that running fstrim reduces freezeups under heavy load from 1-2s to almost nothing on my Toshiba drive.
I worked at a storage company, and they reinforced to us how not only does the OS lie to us, but the hard drives also lie to the OS. So you can't take anything you get from a hard drive as reliable, you have to test the data once you get it, ex through CRC, etc. Data can get corrupt at any time.
As densities of data get higher and higher, it doesn't take much to have a catastrophic data failure. The only way to protect against this is having multiple replicas of your data.
"We had one SSD fail in this way and then come back when it was pulled out and reinserted, apparently perfectly healthy, which doesn't inspire confidence."
We've experienced exactly the same thing. Our general course of action is to perform a hard power cycle of the server through IPMI - a warm cycle doesn't seem to work. I've always presumed it was down to dodgy SSD controller firmware given the way it suddenly stops appearing in the output of fdisk -l.
I have three SSDs in three different laptops/desktops. In their current host machines they've been working flawlessly for a couple of years. Prior to my figuring out which SSD paired best with which host machine, I experienced intermittent strange and catastrophic problems (unreadable sectors to complete data loss) with each one. These were different brands, different capacities, bought in different years.
It's sort of a devil's bargain - the performance of SSDs is so much better that I can't pass up using it over a spinning disk even if they occasionally lose everything. There was a great game for the original Nintendo called "Pinball Quest". As you advanced through the game you could get upgrades such as side stoppers, stronger flippers, etc. You bought these items from a demon in between levels. After the red "Strong Flippers", the next upgrade was the purple "Devil's Flippers". The trick was that occasionally they'd turn to stone when you needed them and possibly cause you to lose the pinball. But they were such an upgrade over the Strong Flippers (when they weren't turned to stone) that you bought them anyway.
It also doesn’t help that at least Windows 10 is seemingly now ”Optimized for SSD” in the sense that performance is quite terrible on a traditional HDD. I imagine this will become more and more common as seek times and hard drive thrashing becomes practically invisible to users as well as developers. It will just get harder to go back as time goes on.
I just don't keep anything important on my SSD. My desktop's SSD is for Windows and games. All documents and other stuff goes on my mechanical drives and my AppData folder is backed up every night too. Everything on my laptop's SSD is either in cloud storage or in an external git repo. I'm 100% prepared for the certain eventuality of any of these SSDs going tits up unexpectedly and catastrophically.
I'm more worried about everybody else who gets an SSD and doesn't take the right precautions, because everybody sells SSDs as being so much more reliable than mechanical drives.
I worked as a PC technician for a while recently. Of the handful of catastrophic failures of mechanical drives we had, the majority of those were ones that were physically dropped, resulting in a head crash. Otherwise, we generally always managed to save data from failing drives. Any failing SSD we encountered was just dead, since there are really only two states: Fine or failed. There was nothing we could do except refer them to a data recovery company that charges thousands of Euros.
I've had a few SSD's give me random issues, and they are so hard to pin down, sometimes they just work, other times they abruptly stop, or just aren't detected until like 3 reboots later and they work fine. After you've had troubles they make you feel like you are hanging on a hope that the ground won't fall out from under you.
You also CAN'T HEAR if there is an issue, which acts as another warning sign that something might be going wrong or will go wrong soon. Loud tickings or clickings or overwork is a sure sign to start backing up and get ready to buy a new drive!
Call me crazy, but I don't think that a Crucial MX300 is the best choice for an enterprise worthy ZFS drive. I get what the author is concerned about, but I wouldn't be that surprised that a consumer level SSD failed in what sounds like a heavily used fileserver.
Modern takes on RAID are built in to home OSs these days, and consumer grade NAS devices are pretty common. It really isn't a fancy enterprise-only technology at all.
Very true. The slack reserve space, RAM buffer (and/or battery), and NAND process are the main things that make an enterprise drive.
The author doesn't specify a metric for writes, but based on the MX300 specs that I find, it can go up to 219 GB/day for the 2TB drives he uses, or 87 GB/day for a couple of the 525GB he still had. He doesn't specify though.
I don't know why my hard drives died either. And while a physical motor breaking is more tangible, a contact wearing out is also imagineable. I don't really care why ssds or hdds die, I care that they do and therefore I have backups (well, ideally I would). I've had spinning rust fail on me while I was sitting at it and it didn't help me save it, it might as well have been dead in zero seconds.
I don't really care why hard drives die either, but I like that, more often than not, I get some warning. SMART logs, or weird kernel complaints, in my experience, are frequent precursors.
I'm a little scared about my new SSDs that have replaced a few rust-spinners in our data center.
That's the big difference for me. Drive's making tictictick sounds? Kernel log full of bigScaryErrorsLikeThis? It's time to ditch that disk before the disk ditches you. Make it happen. Panic-Backup if you need to. etc.
Every SSD failure I've had, the failure mode was "what SSD?"
Now, I realise most people should ponder their backup regime before the tictictick, not after. But as the phrase goes "The best time to plant a tree was 20 years ago. The second-best is now." The SSD equivalent is "The be- nope, too late."
They're just terribly unforgiving, which doesn't fit with a culture that values cure over prevention.
It may be irrational, but I remain very distrustful of SSDs, in part for reasons like this. I use them occasionally as temporary storage, but I don't use them for anything that would cause me a headache if the drive died without warning. So far, my observation is that their lifespan is considerably shorter than spinning platter drives, and spinning platter drives typically give plenty of warning before actually dying.
Perhaps I'll grow more comfortable after another decade or so, when there is enough real world experience to go by.
Maybe I am being overly simplistic, but shouldn't it not matter?
Who in the modern age doesn't back up everything all the time? Don't we all operate with the assumption these things are going to blow at any time? 90%+ of my data is on cloud storage now anyway. When a SSD goes out don't you just chunk it in the drawer of old drives that you promise to take to the disposal center this weekend (and never do) and then take a quick trip to your local computer store for a new one?
This reminds me of something an IT support staffer told me a long time ago... "The difference between a IT pro and a user is that to an IT pro hard drives are a consumable resources".
Replacing an SSD is not free, and in most cases it's not easy. Maybe an IT pro can just roll down to the computer store for a new one, put it in their laptop (for free!), and throw a $100+ drive in a drawer without even thinking about warranty, but most people can't. A backup doesn't excuse excessive rates of failure and weird glitches.
Replacing an SSD in a modern Apple laptop is literally impossible. You need to replace the whole dang laptop (or motherboard, whatever they call it these days), which is not something a user can do.
Thank goodness for Backblaze, Time Machine, Carbon Copy Cloner, Drobo and Synology. Maybe I have gone overboard, but I have not lost any data in 12+ years.
The post is about predictability and warnings before a drive dies. The work and cost to replace it doesn't change anyway (since you do replace the drive at the first sign of warnings, right? otherwise what's the point of wanting them?). The only difference is if you get an extra chance to copy the data before you replace the drive - which is no difference at all if you have proper backups.
If we'd be arguing about mean time between failure or total cost of ownership, then it'd be relevant, but this post isn't even claiming that the rates of failure are excessive (compared to what?), just that they are too weird and unpredictable for the author's liking.
It matters. The existence of backups don't change that. If you have a failure (be it hardware, software, or PEBKAC) that requires you to restore from a backup, that means that you're going to suffer system downtime and you'll have to be spending your own personal time restoring things. Those are not cheap, and we haven't even talked about the cost of new hardware.
I've done a bit of ad-hoc reliability testing with SSD's.
Some years ago I got a great deal on several Pacer disks and wrote a program to write a pseudo-random sequence of data (using a known initial seed) across the entire disk and read it back and compare. Part way through, the data didn't match. No ECC errors, nothing raised by the filesystem, just mismatched bits which came back in a manner which tried to "trick" me into thinking they were good data. This happened on like 5 of the 8 disks. Needless to say I sent those crappy SSD's back to the manufacturer (unfortunately only got a 2/3 refund) along with some harsh words for their engineers.
I've had more name-brand SSD's fail, in various manners (even on well-reviewed Kingston drives). Sometimes in ways that can't be accessed at all, other times (at best of times) in a manner which doesn't allow writes but still allows reads (albeit at a trickle of a datarate).
These days I use solely Intel-based, top-line SSD's, and some (very limited) Samsungs. The choice isn't based on empirical data, but rather and impression their bar is a little higher (or more conservative) in terms of reliability, and simply not wanting to deal with the apparent issues I've seemed to encounter with other brands. The downtime lost from restoring / reconstructing just isn't worth it to me. Maybe I'm paying twice as much as I ought to, but since making the switch many years back it's worked out pretty well and I've been happy / fortunate.
I run my SSD's in RAID10 using high-end controllers (aside from a few in ZFS).
Just my own subjective experiences, again I'm not doing this at scale.
I recently had a similar SSD failure, although it wasn't in a "new fileserver" but my daily use 2013 desktop. It was working, then it was producing write errors corrupting my filesystem, then the whole system died, very quickly. Fortunately for me, some data was recoverable from the corrupted disk; I had a local backup from 12h prior, and a tarsnap backup from about the same time back.
(Um, here's where I have to be critical of tarsnap: their recovery performance is absolutely abysmal for small files. They're latency bound between you, their EC2 instance, and the backing S3 store. Think single or double digit kB/s and then think about how much data you back up with tarsnap. I can't recommend any other backup provider better, but this is an experience where tarsnap left me very disappointed.)
Looking at that SSD and my other SSDs' SMART data, they report extra blocks remaining in SMART, and you can monitor that as it goes down. Ideally you replace the drive before it gets to zero.
My primary mistake was simply not monitoring that data in an effective way.
I don't think anyone who monitors HDDs has any real expectation that the high-level SMART yes/no is going to protect them from data loss. Instead they look at highly predictive factors like "Reallocated_Sector_Ct" or "Raw_Read_Error_Rate" (or even plain old "Power_On_Hours").
NVME SSDs provide SMART-like data on logpage 2 ("Available spare", "Percentage used", "Power on hours"). For some reason the NVME spec does not require media to accept host-initiated self-checks, so most NVMe drives don't have the same functionality as smartctl --test. :-(
For my home setup, at least, it's simple: Put the OS on a dirt cheap 120GB SSD, and all the user data on a multi-terabyte hard disk. You can always selectively migrate other performance critical, but can afford to lose, stuff onto the SSD later. If it breaks, I just buy another one and reinstall the OS. On laptops that can only take one drive, the SSD is it, but so is awareness that the data on them has to be considered ephemeral. I've had assorted hard disks die over the years, from old age, and so far without exception they've been "mostly" recoverable - might have to give up on a few files that got hit by bad sectors, that sort of thing. And have been warned about impending failure by SMART diagnostics.
My first experience with drive failure was a ~40MB HDD expansion card in a 386. The bearings got "sticky", so the spindle wouldn't start rotating. But there was a Al tape covered hole, and you could insert the eraser end of a pencil, and nudge it. So yes, very understandable.
Not too much later, I used Iomega ZIP drives, and experienced the "click of death". That was sudden, and irreversible, but also very understandable.
For the past couple decades, I've consistently used RAID arrays, mostly RAID1 or RAID10 (and RAID0 or RAID5-6 for ephemeral stuff). I've had several HDD failures, but they were usually progressive, and I just swapped out and rebuilt.
I recently had my first SSD failure. And it was also progressive. The first symptom was system freeze, requiring hard reboot, and then I'd see that one of the SSDs had dropped out of the array. But I could add it back. At first, I thought that there was some software problem, and that the RAID stuff was just caused by hard reboot.
But eventually, the box wouldn't boot, so I had to replace the bad SSD and rebuilt the array. It was complicated by having sd1 RAID10 for /boot, and sd5 RAID10 for LVM2 and LUKS. So I also had to run fdisk before device mapper would work.
From reading that blog and it's sister post about "flaky SMART data" on those same Crucial MX500 drives, reminds me that not all SSDs are created equal.
Just like not all hard drives are created equal. My previous job involved a decade running 10 cabinets of servers an hour away with very little manpower: we eventually came to find that IBM/HGST drives were a lot more reliable than others.
We also evaluated some early SSDs, and they were terribly unreliable. We eventually settled on the Intel drives and they were superb. My new job we've been using mostly Intel and Samsung Pro drives, they work great. But Dell sent us a server with some "enterprise SSDs" in it, that we eventually found were Plextor drives. Those things were terrible. We replaced them immediately with Intel, but used some of the Plextor drives and had all of them fail within a year. I'd put the Intel 64GB SLC drives from our 7 year old database server in a system before I'd put one of those brand new "enterprise" Plextor drives in.
I love Crucial, I buy a lot of RAM from them, but I'm skeptical of switching to other brands of SSDs. The more experience I have, the more conservative I get with systems that matter.
I had a bunch of Crucial SSDs die a few years back, they'd work for an hour then disappear from the bus. Reboot and they'd work again for an hour. It turned out Crucial had a small counter tracking uptime by the hour, it would increment the counter to an overflow and crash. This failure could just have easily occur on a spinning hdd.
I actually much prefer this SSD failure mode: Unlike failing spinning rust which will happily linger around coughing up bad data (which will then be written to backups, mirrored drives, etc. potentially creating a huge mess) an SSD going out like a light is comfortingly binary.
Is the thorny problem of elegant vs graceful degradation. In a raid system you want something elegant but not synchronized. In a single drive, some sort of graceful degradation is usually preferable.
Relevant: there is a project called LightNVM [0] which is pushing for a much lower level API to SSDs, that allows most of the complexity to be moved into the host OS (namely, Linux).
To add to the anecdata: My most recent SSD failure happened when I did the firmware upgrade. It worked before the upgrade, the upgrade binary said 'upgrade failed' and the disk vanished and never returned after the 'upgrade'.
This post, more then any other, just convinced me to pull out my old Unison file sync configuration (which was really good, looking at it) and get regular syncs to my NAS (which in turn uploads to cloud storage) working properly again.
having recently swapped 100tb of spinning media to ssd I am awaiting the first failures. now being a business environment it is all mirrored capacity. so I guess my question is from the article, are they running on a single device? No raid or mirror?
I am loathe to even keep my personal data at home on one drive and since I use an iMac that requires me to have time machine as mirroring/etc of the internal drive is not truly possible; at least I did not spend enough time researching it
I think those drives dying quickly is actually a Good Thing™, because chances you're backing up corrupt data might become smaller…
With the older drives you sometimes would have a drive die, replace it, restore your backup only to find that in the process of dying the drive was actually corrupting some of the data which went into the backups, now you've got to hunt down the last uncorrupted versions of the data in the backup…
I don't know the technical side enough to give any real answer, but tossing them in the freezer (properly sealed of course) always seemed to help "loosen" them up enough to get data off.
I've also had some hard drives that you could bring back to life by giving them a firm knock with your knuckles too.
Doesn't really answer much, but it's a last ditch effort that has saved me more times than not.
"When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen"
This is incorrect. As much of the argument seems predicated on this I don't see a real issue.
Redundancy, replication (being able to recreate one failed drive from a certain number of other drives), reliability data, and a replacement budget. That's difficult for personal use.
TL;DR Lack of noises makes SSD drives bad at motivating users to do backups or use redundant storage: they don't seem to be on the verge of catastrophic failure.
I think operating systems should be programmed to wipe out a drive completely once early in the life of every user (around age 20-ish) to burn in their brain the need to back up!
My father had the same concept in getting into a car crash early in my driving life. "Now that you have gotten that out of your system (and I'm glad you're fine), don't ever do it again."
Early flash used to fairly reliable with almost minimal error correction. However with increasing density, smaller processes and multi level cells, it has gone progressively less reliable and slower. Here are some of the things that we need to worry about: https://www.flashmemorysummit.com/English/Collaterals/Procee...
To compensate for all these deficiencies, the SSD architecture and hence the entire FTL becomes very complicated because any part of it can become damaged at any time. We always have to have backup algorithms to recovery from any scenario. Its difficult to build algorithms that can recovery from arbitrary failures in a reasonable time. I cannot have a drive sitting around for 20 minutes trying to fsck itself.
Another problem is that the job while rewarding is not very lucrative. The chance of a multi million dollar payoff for an employee is low. I have a higher chance working on a web connected gadget to become a millionaire. So that means it is really hard to recruit those who are top notch programmers who known how to figure out the algorithms, write the code, debug the hardware. Most new grads these days are interested in python, javascript and machine learning.