Why I'm usually unnerved when modern SSDs die on us

pkaye · on Dec 11, 2018

I worked on SSD firmware for quite a long time and here is my perspective.

Early flash used to fairly reliable with almost minimal error correction. However with increasing density, smaller processes and multi level cells, it has gone progressively less reliable and slower. Here are some of the things that we need to worry about: https://www.flashmemorysummit.com/English/Collaterals/Procee...

To compensate for all these deficiencies, the SSD architecture and hence the entire FTL becomes very complicated because any part of it can become damaged at any time. We always have to have backup algorithms to recovery from any scenario. Its difficult to build algorithms that can recovery from arbitrary failures in a reasonable time. I cannot have a drive sitting around for 20 minutes trying to fsck itself.

Another problem is that the job while rewarding is not very lucrative. The chance of a multi million dollar payoff for an employee is low. I have a higher chance working on a web connected gadget to become a millionaire. So that means it is really hard to recruit those who are top notch programmers who known how to figure out the algorithms, write the code, debug the hardware. Most new grads these days are interested in python, javascript and machine learning.

Matthias247 · on Dec 11, 2018

> Another problem is that the job while rewarding is not very lucrative. The chance of a multi million dollar payoff for an employee is low. I have a higher chance working on a web connected gadget to become a millionaire. So that means it is really hard to recruit those who are top notch programmers who known how to figure out the algorithms, write the code, debug the hardware. Most new grads these days are interested in python, javascript and machine learning.

That seems unfortunately true for most low-level infrastructure software. I am very good at those things, and have a medium amount of experience of building web software. Yet I can find a lot more jobs on the latter domain, and they probably won't pay worse (most likely even better).

Besides pay the non-rewarding things in this domain are mostly that the high complexity is often not properly understood by management, rewarded or taken into account in planning. Which then leads to suboptimal products that have been delivered under time pressure.

agumonkey · on Dec 11, 2018

I don't know how the system could / should interface with stupid trends sucking resources (high complexity projects, smart individuals, long term).

snazz · on Dec 11, 2018

Infrastructure almost always becomes a commodity, or at least moves behind the scenes. For it to be lucrative, interesting technical position, you have to have a large-scale enterprise (like Google), serving millions of people, and you have to promote it as a worthwhile career path (also done at Google as the SRE position).

dgzl · on Dec 11, 2018

The business would have to offer employees something lucrative they otherwise wouldn't earn. I'm not sure what that would be however.

the8472 · on Dec 11, 2018

open source all the things, get to collaborate with smart people all over the world.

ams6110 · on Dec 11, 2018

Let's be honest, the chance of a multi-million dollar payoff for an employee is low regardless. Young entry-level devs are optimistic and also vulnerable to believing a line of bullshit on how much their options might be worth some day. I do agree it is a "higher" chance in web/mobile technology, sort of like how your chance of winning the lottery is "higher" if you buy 10 tickets instead of 1.

woolvalley · on Dec 11, 2018

You still get paid a lot more working at google working on generic backend protobuf shuffling than you will working on SSD firmware at a hardware company or intel's C++ compiler.

monocasa · on Dec 11, 2018

For those doubting you, the going rate for embedded engineers out here in the Denver area where a lot of these SSD controllers are designed is ~$90k. Embedded engineers get peanuts for some reason.

bacon_waffle · on Dec 12, 2018

I've noticed the same thing, and suspect that it's related to the way that higher level software scales compared to embedded.

If a line of code is written to run in a customer's browser, then that line of code may be deployed to millions, maybe billions, of customers. But, if an equivalent line of code goes in to firmware for some widget, then you're doing pretty good to get that line in to a million widgets at all, and it's going to take a lot longer too.

londons_explore · on Dec 12, 2018

Write an embedded bootloader and there's a good chance your code will be used by a billion people within a few years of writing it.

You'll still get paid peanuts for doing it.

ajmurmann · on Dec 12, 2018

I think it's a issue related to the visibility of the quality of the work. If the product is even 20% more reliable (whatever that exactly means), hardly anyone will actually notice. Nobody notices the absence of an error. That's although it probably took a huge effort to achieve this. Making a user flow just a little bit nicer is very visible and gets attention.

I always felt that ISPs suffer a similar problem. Nobody cares if everything works as expected, but we'll get upset of it doesn't. However, there is hardly anything outside of what we take for granted they can do that we will actively appreciate. What could a embedded engineer working on SSDs do that will be noticed, appreciated and not taking for granted by customers?

bacon_waffle · on Dec 12, 2018

Yes, and that's just it.

If a software product takes off, that can happen incredibly quickly, and the new product is primarily composed of code. If a hardware product takes off, the change can't be nearly as fast as it's bound by manufacturing, and the code is just one component in each thing.

I happen to be working on firmware for a VOIP phone today - we'll end up making N million of these things, over some number of years. If I were working on an Android app with similar functionality, that app could conceivably go to N million people tomorrow, or 10N, or 100N...

Anyway, I don't think I have a particularly clear or concise (or even correct) argument here, but it's the only way I've been able to rationalise what we've observed.

lake99 · on Dec 12, 2018

I agree with your comment about scaling, but I think you have underestimated the firmware deployment figures.

Ultimately, whatever we write that goes into firmware is hidden from the customer. The customer pays a price per unit, and other hardware vendors are competing against your product. This competition keeps the overall cost low. Except at the very top level like Intel or Samsung, semiconductor manufacturers seem to be fighting one financial crisis after another.

Competition does not work like that in Software. The products (even in the same domain) are all too different from each other, so though they may be competitors, they are rarely in direct competition.

nickpsecurity · on Dec 12, 2018

Nah, it's just management incentives with nothing on liability side for the product countering that. They keep labor costs low. The company still makes piles of money off the product. There's no liability for devices, esp cheap, that fail randomly after such and such period of time. So, no negative outcomes for keeping labor cost down for firmware. So, they keep doing it.

That simple.

kabdib · on Dec 12, 2018

In Xbox, many of the firmware hires the hardware folks made seemed to be payed poorly. Also, there were tons of contractors, and not much institutional knowledge was retained. (At one point they had to pay a consulting firm to decompile the firmware for a controller because they had lost the source code).

"Why do we need source control? It's all there, right on my laptop. Source Depot is just a bunch of trouble." [rough quote from memory, maybe conflated from a couple of engineers]. I'm happy to report that things got better.

On the software side of Xbox the people were much better compensated, and we wrote lots of firmware, too. It was probably harder to be hired, though.

optimiz3 · on Dec 12, 2018

"Why do we need source control? It's all there, right on my laptop. Source Depot is just a bunch of trouble."

I've threatened to withhold paychecks from employees who have said this to me. The job isn't done till the code is checked in, building, and backed up.

brokenmachine · on Dec 12, 2018

Weird, I thought embedded would be much more sought after and rare.

Matthias247 · on Dec 12, 2018

There's A LOT less people. But the demand is also lower than for people than can build simple websites. It seems like the latter is a lot more important than the first for setting the market rate on salaries. Or maybe it's about the fact that embedded work is often done be electrical engineers, which are seen in a different pay category in some countries.

PhasmaFelis · on Dec 12, 2018

I know what you mean--they could get paid a lot more elsewhere--but it still weirds me out that techies consider $90k/year "peanuts". I know people trying to raise children on one-third of that.

sytelus · on Dec 12, 2018

The hardware companies are not super profitable in the first place. Most just struggle to keep light on in their quest for hitting the margin bottom. Also, lot of these improvements don’t usually translate to higher sales because MTBF data is typically not even published for many consumer SSDs.

The solution for this would be to open source such projects so engineers from many smaller companies can collaborate. These companies needs to understand that they will not be able to attract top talent and collaboration instead of competition is the way forward.

Moral_ · on Dec 12, 2018

Can confirm former Intel SSD in longmont 95k -- shit RSUs.

vermilingua · on Dec 12, 2018

~$90k is now peanuts. Interesting.

zanny · on Dec 12, 2018

If you took the doubling of productivity since 1975 and inflation adjusted the median income from then ($7750) you would get about $68,000. In practice wages haven't risen against productivity and GDP so the real figure is $45,000.

To make it look even worse, the GDP tripled but population only rose 50%, so if you adjusted wages against total economic growth instead of productivity median should be around $110,000. Median.

So yes, $90k is pretty much peanuts for how much money an embedded engineer would be on average making for their employer.

monocasa · on Dec 12, 2018

That's less than 30k in 1980 dollars. Yeah, that's next to nothing.

losvedir · on Dec 12, 2018

Er, what? Are you thinking of "$30k" in today's terms. $30,000 in 1980 was a lot of money! That's more than $90k in today's dollars...

checkyoursudo · on Dec 12, 2018

For example:

http://www.pewresearch.org/fact-tank/2018/08/07/for-most-us-...

barrkel · on Dec 12, 2018

Compiler hacking, OTOH is a lot of fun and actually quite rewarding when other devs use your product.

woolvalley · on Dec 12, 2018

Intel pays not so well, other companies pay better, like apple, fb, google, etc for doing compiler work.

loeg · on Dec 11, 2018

Users and administrators almost certainly prefer a 20 minute IO latency over data corruption. Host operating systems should probably flag an IO as failed long before 20 minutes and then you know 1) nothing made it to disk and 2) have some chance to avoid introducing additional corruption, if e.g., the OS is smart enough to kick out the drive when this happens.

> Another problem is that the job while rewarding is not very lucrative.

Do you mean it's lower paying than typical bigcorp software jobs outside of FAANG, or just that there aren't a lot of startups with astronomical valuations in the media FTL space?

xyzzy_plugh · on Dec 11, 2018

Both, and usually significantly.

Last I checked it was nearly twice as lucrative to be a Ruby-on-rails developer than an embedded engineer.

Embedded also attracts a certain type of engineer, usually very smart and able to manage extreme complexity with attention to detail but at the cost of anything resembling readable, let alone maintainable, software. The fact that anything at all works in the modern world is amazing.

I left the embedded space and have never looked back.

howard941 · on Dec 11, 2018

So $SALT_MINE, a highly profitable privately held Fortune 500 just decided to revamp their pay scales. They've now decided that they want to be at the 50 percentile remuneration-wise in the durable goods sector. That is the white goods sector, washing machines and such.

I predict we will lose all of our engineers - embedded dudes included.

chrisbennet · on Dec 13, 2018

Yeah, but their numbers will look good for a year or two and by then they (CEO, etc) will have moved on to another company before things crash.

Rinzler89 · on Dec 11, 2018

May I ask how did you manage to leave the embedded world and where did you go after? I'm asking since after investing 6 years into this field, which I love, and jumping a couple of companies I realized the market(Europe) is really bad for this gig. Not only is our work highly challenging it's also poorly paid while at the same time our CEO is crying to the local press they can't find devs(to work for peanuts) and is forced to look for them in Asia.

xyzzy_plugh · on Dec 13, 2018

I just went to do something else. I'm definitely a generalist, and now work mostly with "big" data.

I had zero embedded experience when I started doing embedded dev, and then I had zero data experience -- but a surprising amount of general experience is applicable!

nickpsecurity · on Dec 12, 2018

"Embedded also attracts a certain type of engineer, usually very smart and able to manage extreme complexity with attention to detail but at the cost of anything resembling readable, let alone maintainable, software. "

Heaven for generalists that always love doing new kinds of things. Once I learned about it, I knew I probably should've done embedded instead of security research. Of course, now there's significant interest in overlap. Might not be too late to learn all that stuff after all. :)

mcv · on Dec 13, 2018

  > Heaven for generalists that always love doing new kinds of things.

Really? That's me. I'm in web development, and I've always felt that embedded was far too specialised for me.

Though I do prefer readable, maintainable code.

loeg · on Dec 11, 2018

> Last I checked it was nearly twice as lucrative to be a Ruby-on-rails developer than an embedded engineer.

Yeah, yikes. That does not sound appealing :-).

godzillabrennus · on Dec 11, 2018

As someone that identified a bug in a Drobo firmware once and was offered a job on the spot I think the problem with attracting talent is two fold.

The first problem is really two parts, not only is it rare to find people who have passion for storage related technologies but very few will gain exposure to these technologies to develop that passion.

Kids don't routinely grow up with a SAN in the house. They do tend to grow up with lots of internet connected consumer caliber devices and can easily gain exposure to working with these technologies.

I was fortunately able to explore this type of technology in depth because a family owned business let me tinker with their server equipment in high school.

After college I then co-founded a startup back before the cloud became big. That meant we needed to make use of old hardware to provide service to our customers at a price point that made our service profitable. Old drives were not a reliable way to do that. New drives were extremely expensive for old servers back in the day when SCSI was the interface that you expected for a server. We had to get creative and play with JBOD devices. ZFS was an amazing tool for us in those days, and it still is for anyone who wants to tinker.

The other aspect is that while these skills are valuable for creating a "job" they do not have potential for creating "massive wealth". Why learn about storage if you aren't going to be part of the first 10 employees at a company that has a $10B exit? Let Amazon and the other cloud vendors worry about that stuff.

Knowledge is power though. I recently came across an AI startup that I'm now helping. They were spending significant money using GPU computational power to provide artificial intelligence training through a cloud provider. They blew through about $300k in credits within the first year to give you an idea of how much money that type of power can cost.

I am now helping them cut over to their own co-location facility. The first year alone they will save so much money it will pay for the next three years.

Then you read articles like this: https://www.newyorker.com/magazine/2018/12/10/the-friendship...

Reading that helps reinforce the idea that no matter what path you are on in the field there is potential that some random thing you learned about an SSD firmware helps you optimize some growth stage companies product and ultimately that helps you build wealth.

ezoe · on Dec 11, 2018

>very few will gain exposure to these technologies to develop that passion.

This, exactly.

SSD firmware is opaque, hard to learn from outside. On the other hand, trending web-based framework has all the source code opened, with great documents, and ready to use tools. No wonder young people of today find passion on other things rather than SSD.

loeg · on Dec 11, 2018

I'm mostly curious because I work in a storage-adjacent field (NAS) for a BigCorp and the pay is pretty good, if not quite FAANG level. It's not a startup by any means, but I will easily become a multi-millionaire in a handful of years. I was curious about the other side of the fence.

apetresc · on Dec 11, 2018

Huh? If you're going to "easily become a multi-millionaire in a handful of years", then your pay is more than pretty good, and certainly not worse than FAANG level.

loeg · on Dec 11, 2018

Sorry, handful of years from today. I've been working for 7 years now. FAANG comp would probably be 20%-25% higher; I'm mostly good at keeping my expenses down and saving a high proportion of my income. I've also had the good fortune of the bull market working in my favor for the entire time I've been employed.

mandelbrotwurst · on Dec 11, 2018

Maybe his hands are larger than ours ;)

loeg · on Dec 11, 2018

Only the ten digits ;-)

mandelbrotwurst · on Dec 12, 2018

Even at that pedestrian level, if you pack them tightly enough I could imagine fitting a few billion years at least into the volume of an average-sized handful!

hueving · on Dec 11, 2018

What is a handful of years? That sounds like FAANG pay to me.

loeg · on Dec 11, 2018

Sorry, handful of years from today. I've been working for 7 years now. FAANG comp would probably be 20%-25% higher; I'm mostly good at keeping my expenses down and saving a high proportion of my income. I've also had the good fortune of the bull market working in my favor for the entire time I've been employed.

brokenmachine · on Dec 12, 2018

>As someone that identified a bug in a Drobo firmware once and was offered a job on the spot

Just curious how you did that. Did you disassemble their firmware?

godzillabrennus · on Dec 16, 2018

It's been about 5+ years at this point so I don't recall all of the details. I can recall that they offered me a job after I pointed out the bug. From what I recall, I politely declined but I helped them test a beta firmware for a while.

thfuran · on Dec 11, 2018

>The other aspect is that while these skills are valuable for creating a "job" they do not have potential for creating "massive wealth". Why learn about storage if you aren't going to be part of the first 10 employees at a company that has a $10B exit? Let Amazon and the other cloud vendors worry about that stuff.

That seems like a completely ridiculous way to try to organize your life. Almost no companies have $10B exits.

klaasvakie · on Dec 11, 2018

> Users and administrators almost certainly prefer a 20 minute IO latency over data corruption.

If the drive part of a RAID setup I would actually prefer it just reports itself failed and doesn't slow down access to the array by scanning itself for 20 minutes.

As far as I know, that is one of the main differences when buying enterprise or nas drives compared to consumer drives. With nas drives, the firmware gives up very quickly since the drive is assumed to be part of an array with redundancy. Consumer drives will retry reads for a very long time before reporting i/o error.

loeg · on Dec 11, 2018

Potato, potato. If your array doesn't kick out a drive that stalls for 30 seconds, much less 20 minutes, why not?

russdill · on Dec 11, 2018

This is a bit of a misrepresentation. The only flash that never had ECC was NOR. Some embedded systems had NOR, but I'd be incredibly rare to find a consumer SSD with NOR.

When designing a NAND memory product, you aim for some max allowed error rate. You choose the error correction algorithm based on that target. Error rates for NAND products are precisely what the designer intended.

Because SSDs are so large and there is such a large number of them, errors can still occur (at a known rate). FTLs should take that into account. Critical data structures can add checksum redundancy which can reduce the error rate for those to an even lower value, which is usually necessary anyway since power disturbances during erase or program can cause programming errors.

There are of course a number of patterns that increase error rates that FTLs have to be programmed to prevent. Encrypting is the first important step, since it makes the data unlikely to be uniform. The second is mitigating read disturb.

pkaye · on Dec 11, 2018

I don't mean to imply NAND flash never had ECC. Just that the early flash had needed maybe 1-3 bits of ECC and I would never see an 1-bit error until maybe a few months of use. Things were a lot easier back then. Scrambling and read disturbs were not a issue until later on. All the things you suggest can be done. Just so many opportunities to screw up in the implementation despite the best efforts.

russdill · on Dec 11, 2018

Oh, and as someone noted below, spinning disks have also long had error correction in order to reach some target error rate.

Vadoff · on Dec 12, 2018

You can even offer 0 equity, but pretty sure if you offer a $180k salary you'll have a flock of new grads apply.

The reason you're not getting enough new grads is probably because your compensation isn't competitive enough.

SolarNet · on Dec 12, 2018

I'm surprised no one has mentioned the real difference here. In most startups employees are granted equity. They are closer to worker collectives than they are corporations. In corporations only the capitalist gets any return, they are extracting value from their employees.

The point being, the solution to this sort of systemic problem is for more corporations to become worker owned. If the guy writing the SSD algorithms has a say in governance and a cut in the profits they will want to stick around. It's the stable version of startups.

feelix · on Dec 12, 2018

Sorry to spin off topic, but perhaps having the chance to speak to someone that's worked on SSD firmware for the first time...

is there any feasible way to recover data after a TRIM command has been issued that you can think of? Is there any way to trick the firmware into not returning 0's when reading the blocks of a deleted file? Mostly interested in doing so for Apple

TRIM destroying the entire data recovery and forensics market seems like such a big deal, I still can't believe it although it started years ago

ezoe · on Dec 11, 2018

But why is that the pay is not so high?

Manufacturing the hardware requires a lot of capitol to begin with. Why these SSD vendors can't afford the high salary to lure the top talents?

ams6110 · on Dec 11, 2018

Hardware is a ruthlessly competitive business and profit margins are very thin.

mcv · on Dec 13, 2018

Even so, I'd imagine more reliable SSDs would be able to fetch a higher price.

sebazzz · on Dec 11, 2018

Are you saying that TLC SSDs are essentially unreliable and it is a miracle we don't see higher failure rates?

Regarding your point regarding languages: I would be interested and motivated, but I probably don't have the skill (probably, only did some x86 ASM/C++) nor the location (Europe). Usually someone doesn't just start with C++ but with a managed language, and once someone lands a job becomes demotivated or just doesn't have enough time.

howard941 · on Dec 11, 2018

It's not any harder to recruit for that role than any other bare metal, embedded position where new grads are expected to be top notch programmers who know how to figure out the algorithms, write the code, debug the hardware, and bring decades of experience. IOW you're recruiting for unicorns and the pickings are slim.

jiveturkey · on Dec 12, 2018

> Another problem is that the job while rewarding is not very lucrative.

Why is this a problem?

Let's not forget, we live in a capitalistic society. The job of the capitalists is to exploit labor. The cost of labor is a direct result of the market demand (or not) for a more complex or, by your point, a more robust product.

It need not simply be a market reaction, though. A vendor can create the market for a more complex/robust product. But still, their job is to exploit the available labor so if the people capable of such are available at a lower cost, so be it.

> I have a higher chance working on a web connected gadget to become a millionaire.

As others have noted, perhaps true but the numbers are so small it may as well also be zero. However, as you should now be pointedly aware, the perception doesn't match reality.

Question: do new grads (or let's say up to 3 years experience) working on SSD firmware, making $90k, have to clock in 60+ hours / week and live in insanely high cost regions where their $120k startup salary qualifies them for housing assistance programs and they have to live with at least one if not more roommates? Or is $90k (per another comment) quite good in relation to the total picture?

IOW, my guess is that you are not competing for talent on salary or total comp. My guess is you are competing on general industry attractiveness. The entire mindset around embedded vs web/consumer programming is different. So I think to frame it as a compensation problem is wrong, and by misframing it you will never "solve" it.

niftich · on Dec 11, 2018

Not that spinning HDDs are really any different, but SSDs are a perfect example of an entire computer that you attach to yours, and speak with through one of the (many) storage-oriented protocols. The device itself is a black box, and complex transformations take place between the physical persistence of the data and the logical structures that are exchanged on the wire. There are many layers of indirection, and many things that can go wrong, from fault with the underlying physical storage, a physical fault in the controller, or a logical (software) condition in the controller that puts it in an unrecoverable state.

Spinning platter drives have parts that form a more relatable metaphor to humans' notions of wear and tear: skates of magnetic readers flying on a cushion of air above a rapidly rotating disc, with the gap separating a few dozen nanometers, often smaller than the process size in the controller's silicon. They have arms that can move the head over a particular disc radius, and a motor that spins the entire stack of platters. These mechanical components exhibit wear proportional to their use -- this makes intuitive sense, and is also recorded in the SMART attributes, so drives in old age and of many park cycles can be replaced preemptively before they catastrophically fail.

SSDs are missing many of the usual mechanisms that would contribute to physical wear leading to sudden catastrophic failure in advanced age. This means that irrespective of their failure rate vs. HDDs, a higher proportion of their catastrophic failures are the fault of the controller. This is discouraging: essentially, the "storage layer" is now quite reliable, so the fallibility of the human-programmed controller is brought into light.

marcosdumay · on Dec 11, 2018

SSD's FLASH has plenty of wear and tear. It just has a different format from what happens to mechanical products. It's more similar to some metal that gets scratched on use, and will rust faster or slower depending on the amount of scratches.

What we have is that the software is currently less reliable than the memory. There is no fundamental reason for that, it's just that manufacturers put a huge amount of engineering work on reducing the wear, and not so much on programming practices.

cyphar · on Dec 11, 2018

> skates of magnetic readers flying on a cushion of air above a rapidly rotating disc, with the gap separating a few dozen nanometers, often smaller than the process size in the controller's silicon.

Complete aside, the fly-height of a magnetic head is actually fractions of a nanometer (i.e. hundreds of picometers).

EDIT: I got this from a talk by Bryan Cantrill[1]. The fly-height is allegedly 0.8 nanometers (800 picometers).

[1]: https://youtu.be/fE2KDzZaxvE?t=1551

hwillis · on Dec 11, 2018

I do not believe that.

1. I can't find a source that says less than a few nanometers.

2. 300 picometers is roughly the diameter of a helium diatom. The head cannot possibly float through hydrodynamic means if an air molecule can barely even fit under it.

blattimwind · on Dec 11, 2018

> The head cannot possibly float through hydrodynamic means if an air molecule can barely even fit under it.

It can. Since siblings liked airplane analogies, here is another one: Consider the head to be an airplane. It has somewhat wing-similar features which provide a lifting force, but the actual read/write head sits below those features (like, say, a landing gear is below wings).

evntdrvn · on Dec 12, 2018

This is correct :)

dogecoinbase · on Dec 11, 2018

Neat, you independently discovered why the industry is moving to helium-filled drives.

cyphar · on Dec 11, 2018

Bryan Cantrill made this claim in a talk recently[1], and the fly-height is apparently 800 picometers.

[1]: https://youtu.be/fE2KDzZaxvE?t=1551

the8472 · on Dec 11, 2018

You can at least fit several layers of iron atoms into that gap, assuming they're part of the crystal lattice. So the surface does not have to be perfectly smooth.

https://en.wikipedia.org/wiki/Atomic_radii_of_the_elements_(...

DoctorOetker · on Dec 11, 2018

I didnt know they where that close, isn't that similar distances as an scanning tunneling microscope tip to an object? or even closer?

Could you provide a reference to a scientific article stating this?

Misdicorl · on Dec 11, 2018

This seems implausible at first glance. Even making objects flat at the nanometer scale is quite difficult. When you also introduce movement tolerances (e.g. rotation disk is not perfectly aligned), this seems quite extraordinary.

Can you provide some references?

cyphar · on Dec 11, 2018

(Sheepishly) My only reference is a talk by Bryan Cantrill[1] where he states that an exec at a hard-drive manufacturer stated it was ".8 nanometers". I will try to find a better source than that.

EDIT: There is a paper from 2016 which did an analysis in the difference in the flying height of the head during different operations, and it was measured in Angstroms[2]. I couldn't find one that actually gives a precise value of the flying height. There is a 2011 paper which states that some system they were testing allowed for 4-9 nm flying heights[3] which is about half-an-order-of-magnitude larger than the claim -- but that's already 7 years old.

[1]: https://youtu.be/fE2KDzZaxvE?t=1551 [2]: http://iieng.org/images/proceedings_pdf/E0116009.pdf [3]: http://maeresearch.ucsd.edu/callafon/publications/2011/UweIE...

Misdicorl · on Dec 11, 2018

This is pretty awesome! Thanks for the links.

loeg · on Dec 11, 2018

Disk platters rotating at 7200 rpm are both perfectly smooth and perfectly aligned.

The read head is something like a jumbo jet flying a handful of feet above the (perfectly smooth) ground. It's really crazy how close these things are, moving very fast. And why accelerometers are a significant feature.[0]

> Can you provide some references?

Wikipedia claims[1]:

> In 2011, the flying height in modern drives was a few nanometers.

and

> The "flying height" is constantly decreasing to enable higher areal density.

> At 7,200 RPM, the edge of the platter is traveling at over 120 kilometres per hour (75 mph)

So you've got a read head flying at 120 km/h => 33 m/s => 33,000,000,000 nm/s at a height of 3nm or less. Picture that!

E.g, a 757 typically cruises at 858 km/h => 238 m/s. So picture your 757 flying at an altitude of 21 nm and that's the metaphor, kinda. The read head is a bit smaller than 1/7 of a 757 jet, obviously.

[0]: https://en.wikipedia.org/wiki/Head_crash

[1]: https://en.wikipedia.org/wiki/Flying_height

Misdicorl · on Dec 11, 2018

The claim was hundreds of picometers, which is an order of magnitude smaller than a few nanometers. Literally nothing can be perfectly flat, nor perfectly aligned.

blattimwind · on Dec 11, 2018

It doesn't need to be. The basic idea is that the fly height is self-regulating; if the head goes away from the platter, its "lift" is reduced, so the springiness of the arm forces it back to the platter. If it moves closer to the platter, lift is increased, so it moves away. Similarly the tracks don't have to be perfectly round or concentric, because the lowest-level head control system in the drive actively tracks the head's current disk track on the surface; head movements aren't a "rotate 11.57821° to track 28139123", but rather "track 28139123 is around here somewhere, lets find it".

Misdicorl · on Dec 11, 2018

Yes, I understand how feedback control systems work. The question was whether they really have the fidelity to do it at hundred picometer resolution.

I am dubious because this is a pretty incredible feat. This is the length scale at which atom-atom interactions become important. That implies that the crystal lattice structure of both the read/write head and the underlying platter will affect the dynamics of the system!

loeg · on Dec 11, 2018

> Similarly the tracks don't have to be perfectly round or concentric

They have to be pretty dang close at 7200+ rpm.

evntdrvn · on Dec 12, 2018

Nope. Since a long time, tracks are virtual (array of raw track IDs as a function of sector ID when the arm is held "constant" and averaged for X thousand RPMs as part of the per-drive firmware Calibration at the factory).

loeg · on Dec 14, 2018

How does the second sentence produce the conclusion "nope?" If the arm is held "constant" and the platter is spinning at 7200rpm, it's going to trace something very close to a perfect circle.

loeg · on Dec 11, 2018

The 3nm figure is from 2011. 0.9nm, or 900 picometers, is maybe not an unreasonable progression from 3nm in 7 years. And — it's perfectly flat relative to everything else involved.

loeg · on Dec 11, 2018

I don't really agree with most of this comment. HDDs are also inscrutable black boxes; many of their failures are controller, rather than media losses; and SSDs also report SMART attributes that are predictive of failure. It's certainly possible HDD vendors have done a more successful job of convincing buyers that failures are attributable to the media rather than the controller, but utilizing the media fully with shingled recording and HAMR and all that jazz really requires a similar degree of controller complexity as an SSD FTL controller.

blattimwind · on Dec 11, 2018

> These mechanical components exhibit wear proportional to their use

Actually the spindle isn't touching anything any more, because the spindle/rotor (one part) is supported by a fluid bearing; it basically floats on a thin layer of oil. If the spindle touches the bearing at, essentially any speed that isn't zero, the bearing surfaces are damaged instantly and the resulting burrs and debris will degrade and lock up the bearing very quickly.

I believe the only rolling-element/contact bearing used in modern disks is the pivot bearing of the arm assembly.

AnIdiotOnTheNet · on Dec 11, 2018

> but SSDs are a perfect example of an entire computer that you attach to yours, and speak with through one of the (many) storage-oriented protocols.

These days most of what we call a computer could be described this way. Even your compiled machine language is ultimately far more abstracted from what the processor actually does than it was on, say, a 6502.

zokier · on Dec 11, 2018

> Even your compiled machine language is ultimately far more abstracted from what the processor actually does than it was on, say, a 6502

Funny that you bring up 6502; that reminds me of 1541 disk drive for C64, which had mostly same 6502 as the host computer (albeit running at slower speed).

AnIdiotOnTheNet · on Dec 11, 2018

Most disk drives of the time were like that. One of the reasons the Apple II disk drive was so affordable is that Woz just used the Apple II's own 6502 to handle the grunt work, with their disk drive being little more than a drive mechanism, a PROM, and some ICs. Since this is The Woz we're talking about, he went ahead and broke with conventional encoding while he was at it and instead implemented a scheme which allowed for a few extra sectors per track.

It might be an interesting exercise to see how many peripherals are connected to your PC right now that have much more computing power than an 1Mhz 6502.

ratboy666 · on Dec 11, 2018

The Apple ][ disk encoding was not done to allow "for a few extra sectors per track". Since the Apple ][ did not have a disk controller, the encoding from magnetic flux to bits was handled in software. A 1Mhz 6502 cannot handle the "standard" encoding (things like M2FM) in software as the disc was rotating. Woz's encoding allowed the encoding.

rmu09 · on Dec 11, 2018

The 6502 in the 1541 runs at 1MHz, that is indeed a bit slower than the NTSC version C64 CPU (1.023MHz), but a bit faster than the PAL version CPU (0.985MHz)

daemin · on Dec 12, 2018

This reminds me of a story that a lecturer told us at University. That at one point they distributed computations to the drive controllers of the connected (fridge sized) disk drives because they could do so much processing while waiting for the platters to spin into place.

So the whole story of a disk being a computer has been true for a long time.

mcv · on Dec 13, 2018

Even so, if SSDs are an entire computer on their own, could that computer at least have the decency to provide some helpful logging?

Waterluvian · on Dec 11, 2018

"When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen"

Why shouldn't it? Isn't it just hardware too?

"With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that"

Why can't you do the same with SSDs?

It feels like the author's main complaint is the frustration of not understanding SSD hardware as well.

Is this a valid complaint? Are SSDs magical in some way? I'm not an expert but... It's just hardware with pieces that do stuff. Why can't we come up with an understanding of why it fails?

rsync · on Dec 11, 2018

"It feels like the author's main complaint is the frustration of not understanding SSD hardware as well."

What is so frustrating about SSDs is how very poorly they compare to previous incarnations of solid state storage.

Using Disk-On-Chip and/or IDE-pin-compatible CF cards, I had many, many devices in the field that lasted, mounted read-only, for decades An entire sect of the computing industry came to rely on these parts as alternatives to spinning media that could not mechanically fail.

This is not the case with SSDs at all. They fail left and right, even mounted read-only, for all manner of complicated and interesting reasons. It's very frustrating that SSDs are not a step forward in reliability from spinning media and are a step downward compared to (for instance) a 16MB consumer CF card from Sandisk, circa 2000.

rsync.net filers, which need a boot mirror, are always constructed with two unrelated SSDs - usually one Intel part and one Samsung part - so that when the inevitable usage-related failure occurs, it does not occur simultaneously to both members of the mirror which have, being a mirror, been subjected to identical usage-lives.[1]

We shouldn't have to do that.

[1] I can't overstate this - if you need a RAID mirror, do not use identical SSDs for the two members of the mirror. There are many, many cases of SSDs failing not due to "wear" or end-of-life, but due to weird usage edge cases that cause them to puke ... and in a mirror, you give the two parts identical usage ... we either get two different generations of Intel part (current gen and just-previous gen) or we get current Intel and current Samsung ...

endorphone · on Dec 11, 2018

Every failure of an SSD feels like an exceptional event. Some harbinger of doom that needs to be shouted from the rooftops. The prions of storage.

But magnetic hard drives failed all the time. I have a giant stack in my office closet just from my dev machines over the years. But it wasn't new and scary -- it was just a hard drive failing -- so it was just normal. Some had controllers fail, suddenly blinking out of existence. Another had cache memory corrupt so it just gave ridiculous readings occasionally. Others had physical failures.

I don't know where to begin relative to prior flash memory (e.g. CF cards) which were absolutely notorious trash.

It is worth noting that every smartphone the world over has an "SSD" in it. We spend remarkably little of our mental power concerned about the flash storage. It is the cause of a negligible amount of device failures.

userbinator · on Dec 12, 2018

compared to (for instance) a 16MB consumer CF card from Sandisk, circa 2000.

That would almost certainly be small-block SLC flash rated for 100K program/erase cycles and 10 years of retention. The huge-block TLC now is ~1K program/erase cycles and 2-3 years of of retention depending on where you look (the manufacturers are not surprisingly quite reluctant to release these specs...)

Mister_Snuggles · on Dec 11, 2018

I've heard very similar advice for non-SSD mirrors too. Use different manufacturers or, at the very least, use different batches of disks from the same manufacturer.

blattimwind · on Dec 11, 2018

Most importantly, never assemble an array from drives that took the same shipping path i.e. came out of the same parcel.

lsc · on Dec 12, 2018

Most people use different batches but the same manufacturer, mostly because they remember advice from back when you could get a hardware raid card that would synchronize your scsi drives.

Also, though, because most serious RAIDs contain more drives than you can find manufacturers of hard drives.

asdfadsfgfdda · on Dec 11, 2018

On modern MLC/TLC SSDs, read-only mode doesn't really exist. The NAND blocks must be re-programmed after a number of read accesses to mitigate read disturb. If anything, mounting read-only is probably a corner case, stressing the firmware's read disturb mitigation.

The tradeoff from worse read disturb characteristics is NAND that is 100x cheaper per GB than in 2000.

vkuruthers · on Dec 12, 2018

Do you have any more details on the re-programming that would be occurring on non SLC flash cells, even if mounted in read only mode? This is something I was always concerned about too.

nickpsecurity · on Dec 12, 2018

"We shouldn't have to do that."

I don't know about that. I know most managers in embedded software go cheap on engineers and software assurance on purpose to get bonuses and such. The hardware side makes me think you haven't studied deep, sub-micron hardware much. I started looking into it a few years ago or so, just reading the slides on lots of stuff even though not understanding much of it. They helpfully put a lot in lay terms, though, with lots of comparisons. To say the stuff gets harder every time you shrink to a smaller node is an understatement, esp for solid state.

If anything, the modern flash should probably be considered broken right as they ship out of the factory. If not, the process nodes after 90nm or so just keep having more and more ways for individual components to screw up or change behavior across same wafer. Some happen instantly by design. Some happen later with aging. The memory technologies are closer to that analog level of things than most with harder verification. The high-density flash on newer nodes uses less-reliable tech than most just to operate at that low cost. So, they add all kinds of firmware and hardware tricks to try to make it work for a period of time like it's a whole, functional unit of storage despite pieces of it misbehaving all throughout.

It's a nice, man-made miracle these techs even work at all. Those that last longer like you mentioned still exist. I'll add a comment with links to one type so you can compare price/storage/performance to these broken-by-design SSD's you use. I'll throw in another two that mention shrinking challenges so you can see what they face every time they have to upgrade or just deploy new designs in mixed-signal.

https://news.ycombinator.com/item?id=18528023

https://www.electronicdesign.com/digital-ics/understanding-2...

https://anysilicon.com/ic-design-impact-in-moving-from-28nm-...

amatecha · on Dec 12, 2018

Yeah agreed 100%, for my RAID mirror setup I use drives from distinct manufacturers for that exact reason -- I can presumably expect a different failure rate (hopefully). :)

NikolaeVarius · on Dec 11, 2018

Well to be fair, there is an entire layer of abstraction at the SSD controller level that does tons of black box magic. It allows the OS to treat the SSD as another other storage device without letting the OS know what is going on.

So the combination of non-moving parts (making it hard/impossible to debug via physical inspection) combined with a tons of wear leveling/misc magic can defo make it seem like SSD's are magical.

Wowfunhappy · on Dec 11, 2018

Is there a good reason why we use separate SSD controllers instead of letting the primary cpu handle it? The obvious reason is backwards compatibility, but as more of computing moves to SSDs, is this still relevant?

ZFS has shown that removing layers of abstraction with regard to storage can be beneficial.

pjc50 · on Dec 11, 2018

That adds a round of latency, and makes it pretty much impossible to boot off the drive. The blocks aren't in the same order in the Flash as they are presented by its interface, and one of the main jobs of the controller is to re-order them.

It would be an interesting product to have, a raw block API to a Flash device with all the temporary state stored on the host - but a hard one to sell, as it's not differentiated in any way.

blattimwind · on Dec 11, 2018

There are host-managed SMR drives, which share a number of characteristics with flash storage.

marcosdumay · on Dec 11, 2018

The controller makes it possible to get a standard bus and a pre-installed driver and use it to access any kind of memory from any manufacturer and any technology. It's the kind of convenience that makes people buy hardware - it's the kind of thing that made SATA and USB win. The alternative is that once in a while you plug a driver into your computer and it won't work.

Besides, I don't think manufacturers want to release the best practices for using their memory.

selectodude · on Dec 11, 2018

Apple uses their T2 security enclave as the SSD controller and it's caused quite a stir.

joosters · on Dec 11, 2018

I wonder if you can read any extra state out of the T2 as a result, e.g. more information on wear leveling, temporary read failures and so on, more than the standard SMART counters?

Malic · on Dec 11, 2018

Where can I read more about this?

pixl97 · on Dec 11, 2018

https://azure.microsoft.com/en-us/blog/project-denali-to-def...

hanikesn · on Dec 11, 2018

The exact same thing happens hard drives. There's a ton of magic going on.

NikolaeVarius · on Dec 11, 2018

Lots of extra magic happens in an SSD though

https://www.youtube.com/watch?v=C5JoC4-qsO0 https://www.forensicswiki.org/wiki/Solid_State_Drive_(SSD)_F...

relieferator · on Dec 11, 2018

Thanks for this youtube link. In the case of recovering deleted files from SSD's that have the trim command enabled (Forensics Writeblocker was used,) the following drives have a low probability of recoverability: Crucial, Intel, and Samsung (3core controller). Whereas Seagate, Supertalent (parallel ATA to SATA bridge chip,) OCZ, and Patriot; files could be recovered. If the drive is quick-formatted, and if trim is enabled, the data is completely gone on the following drives: Crucial, Intel and Samsung. The trim state impacts the most whether or not the data can be recovered. You can check if trim is enabled with the command: fsutil behavior query DisableDeleteNotify - if your result is 0, it is enabled (default).

gpvos · on Dec 11, 2018

Yes, but most technical people seem to have a mental model that explains most failures pretty well. Not so for SSDs, apparently.

ars · on Dec 11, 2018

> Why can't you do the same with SSDs?

Because they don't die incrementally. Wtih a hard disk you'll get bad sectors, growing slowly over time. Or a head crash and then it's all dead.

What could cause an entire SSD to die at once? I would totally understand bad sectors, but the whole thing at once? Where it doesn't even try to read existing data?

jclulow · on Dec 11, 2018

One potential cause is poorly constructed drive firmware which fails to account for a minor failure of some kind and crashes. If that unaccounted minor failure is persistent, the firmware may crash constantly and then you'd be unable to access even theoretically good parts of the drive.

Firmware -- undebuggable, unobservable, unfixable software, jammed into your devices -- is the enemy.

jug · on Dec 11, 2018

Sounds like there is a business opportunist here for open firmware or even one running on the CPU especially for the enterprise? But maybe they won’t bother either despite valuable data and uptime, and just rather throw more redundant SSD’s at the problem.

someguydave · on Dec 11, 2018

You would need to convince the very small number of Flash memory fabricators to release their detailed specifications and known product weaknesses.

ryanmercer · on Dec 11, 2018

>What could cause an entire SSD to die at once?

A component failing? Electrical components fail. Sometimes it's a manufacturing defect, sometimes a design defect, sometimes something under or over-volted and it was enough to cause damage to any given component.

Could be an IC, could be a capacitor, could be a poorly laid trace. A poorly shielded RF source could even damage any number of components.

I mean, in theory a single high charge particle from that rare cosmic ray that reaches the surface of earth running full-steam-ahead through an IC could cause just the right amount of damage to make it fail although this would be an incredibly improbable scenario.

Same goes for HDDs, televisions, your clock radio, whatever.

marcosdumay · on Dec 11, 2018

> What could cause an entire SSD to die at once?

SSD has a problem in that almost every failure case looks more like a head-scratch than a bad sector.

Even the most local problem that is excessive writing on a sector can take the entire chip down.

vokep · on Dec 13, 2018

There's a single wire in all computers, for some reason they always just have to include it, which if severed will completely disable the computer, as in, it won't even turn on, not even try to turn on.

That wire is the one delivering power. Point being, while things are so complicated that you often can have a lot going on and still try things, there are still single points of failure that can never be fully covered.

blincoln · on Dec 11, 2018

<i>Why shouldn't it? Isn't it just hardware too?</i>

In a mechanical hard drive, there are moving parts which can wear out due to friction, etc.

SSDs are solid-state, so it seems like at least theoretically, it should be possible to build one that keeps working for decades. e.g. I have solid-state hardware from the 70s and 80s which still functions.

I've always been a little mystified as to why SSDs' data areas wear out for that reason,[1] but that's a whole separate issue. The author of the article is writing about sudden failure of the device as a whole.

There are only two explanations I've ever heard for short lifetimes in electronics as a general industry.[2]

For devices manufactured after about 2000 there's tin whiskers,[3] which began to be a problem because of RoHS requirements. I'm not sure if that applies here, though.

If the device includes electrolytic capacitors, my understanding is that those generally have a finite lifetime as well, and it can be fairly short if they're poorly-made.

I'm not a hardware expert, though, so I'd be interested in hearing about other factors, and I'm sure the author of the article would too.

[1] I've seen lots of writeups of how wear-leveling works, etc., but never a good physical explanation of what is actually wearing out over time.

[2] Obviously there are other factors for specific devices, or specific designs. E.g. maybe parts of a specific device break over time due to thermal expansion and contraction if the device wasn't engineered to handle that properly.

[3] https://nepp.nasa.gov/whisker/background/

blattimwind · on Dec 11, 2018

> I've seen lots of writeups of how wear-leveling works, etc., but never a good physical explanation of what is actually wearing out over time.

SSDs are flash are basically EEPROMs. In an EEPROM one bit is stored in a dual-gate MOSFET. One Gate is a normal gate, the other is floating, i.e. it just is a small conductive island. The information is stored by (quite literally) shooting electrons through the insulation of the floating gate into it. They're then trapped on the gate; if you turn the second gate on, the transistor conducts iff the floating gate is also turned on. This shooting action happens to damage the insulation, which at some point is degraded enough that it can't keep the electrons trapped on the floating gate. Hence the electrons leave, together with your information.

> For devices manufactured after about 2000 there's tin whiskers,[3] which began to be a problem because of RoHS requirements.

The solder joints themselves usually don't form whiskers, which mostly grow on pure tin-plated surfaces, e.g. the pins of components (which previously used lead). Using lead-free solder is mostly co-incident with whisker risk, not the cause of the majority of problems.

0db532a0 · on Dec 11, 2018

The electrons aren't shot through the insulation. They end up in the floating gate by quantum tunneling. The insulation by its name is non-conductive, and must be so that the electrons stay put. The degradation happens through general thermal wear.

userbinator · on Dec 12, 2018

The exact wear mechanism doesn't really matter (you must be referring to "hot carrier injection") --- the point is that to record data, electrons are being forced through a material which gradually wears it out.

0db532a0 · on Dec 12, 2018

That was also my point. I was just pointing out that the idea that electrons shoot through the insulation to end up in the floating gate was not true.

Dylan16807 · on Dec 12, 2018

Thermal wear? I thought it was mostly based on write cycles. How can that be 'thermal'?

0db532a0 · on Dec 12, 2018

The write cycles are realised by moving electrons through a material. This creates heat, which leads to gradual degradation of the material.

Dylan16807 · on Dec 12, 2018

Does the individual gate get that much hotter than the rest of the device?

If I keep my SSD 40 degrees cooler with a freezer, will it last an indefinite number of write cycles?

0db532a0 · on Dec 12, 2018

I just looked this up, and it turns out I was completely wrong about how the wear is caused. Apparently it’s due to a build-up of electrons remaining in the floating gate over time, and not general thermal wear. It turns out that heating can even reverse this wear. Explained in this link: https://arstechnica.com/science/2012/11/nand-flash-gets-bake...

falcolas · on Dec 11, 2018

On 1): It's because each cell is essentially a consumable with a limited number of state transfers. It's similar in that way to how a solid state accellerometer still has a moving part inside it that can break or wear out over time.

EDIT: More reasons and methods for failure of "solid state" circuits and systems: https://news.ycombinator.com/item?id=14765868

blincoln · on Dec 12, 2018

I understand that there is something that makes the cells stop working over time. blattimwind's reply is the first actual explanation I've ever seen of what that something might be.

pbhjpbhj · on Dec 11, 2018

On the subject of capacitor death. Is there a robotic device that can take in a circuit board and swap out all the [electrolytic] caps?

sigstoat · on Dec 11, 2018

i've never seen or heard of an automated board rework system. thinking about the steps and things i've had to do to manually rework boards, your machine would need not only be able to apply force to pull parts off of boards without damaging the board, but also be prepared to restore pads / through holes to usable states after desoldering parts, before new ones could go back in.

there's a reason companies don't repair circuit boards in consumer electronics.

pbhjpbhj · on Dec 11, 2018

I've done it myself, save restoring pads -- to me it's just the sort of fiddly, finnickety thing that it seems like robots sold be good at. And, they're not going to burn their fingers or run out of hands to hold things!

PeterisP · on Dec 12, 2018

Robots are good at repetitive tasks and economy of scale. Doing a tricky action for ten thousand boards is something that robots are good at; doing a different tricky action on each of a hundred boards is not.

It doesn't seem plausible to have a business case where you'd get be able to get a large quantity of identical boards (that are otherwise good!), replace caps on them, and be able to sell them for much more than you got them - i.e. that the boards haven't become obsolete in that time. If there's no mass production, there's not much use for automation.

rasz · on Dec 12, 2018

Yes, and it mostly speaks Ukrainian, or Chinese.

gpvos · on Dec 11, 2018

> Why can't we come up with an understanding of why it fails?

Indeed, that is the question that the article's author seems to be asking.

Severian · on Dec 11, 2018

Well, considering it's the cheapest consumer SSD they could afford, and then connecting it to a SAS port might be a whole story in itself.

oasisbob · on Dec 11, 2018

The post resonated with me because of a stupid bug I hit in a ca 2011 SSD. (Samsung?) After 100 power-on cycles, the drive would brick itself. Required a firmware update in time to avoid.

That defect doesn't strike me as being inherently related to SSD media, but really left a bad taste in my mouth with what might be going on in the development process to lead to such instability.

rb808 · on Dec 11, 2018

I think the difference is familiarity. We've been using hard drives for decades so kinda know what to expect. SSDs have only been widely used 5-10y, so we're all just getting used to know how they work and how they die.

nneonneo · on Dec 11, 2018

A major problem with SSDs seems to be “firmware death” - where the flash chips are physically fine (or mostly fine), but the firmware (or firmware memory) has gotten corrupted due to some programming error, electrical glitch, or cosmic ray. I’ve had scores of older SSDs die after things like power outages and sudden shutdown events. This is super frustrating because the data is physically OK but the controller just isn’t responding to any requests anymore.

I wonder if there’s an easy way to distinguish a controller failure from a flash failure from the behavior of the device over the last few seconds/minutes of operation. In theory a controller failure should cause a fairly abrupt loss of service, but I’m sure there are soft lockup failure modes too.

sebazzz · on Dec 11, 2018

I have seen some weird issues with SSDs. I had an OCZ Vertex 2 die on me multiple times, but one thing that stood out most is that after a power cycle or complete system shutdown (note: reboots were just fine) everything I had done last time - install software, update Windows, create files - was gone. The state was reverted to before the time it booted. It was like my computer contained some kind of Reborn chip, except it was the Sandforce controller malfunctioning.

ricardobeat · on Dec 11, 2018

Modern SSDs can have very large DRAM write buffers (8-256MB), this is pretty plausible if it was failing to flush it.

Skunkleton · on Dec 12, 2018

It was probably some kind of wear leveling metadata that didnt get flushed.

yzmtf2008 · on Dec 11, 2018

Modern SSDs have incredibly large caches. For example, HP EX920, which is a TLC (triple layer cell) SSD, in its 1TB model contains a whooping 200GB of SLC (single layer cell) cache. It's entirely possible that your changes are simply in cache only, and hasn't been committed to the actual storage.

undersuit · on Dec 11, 2018

I don't know if I'm getting exactly what you're saying, but a sudden power off isn't going to wipe your SSD's SLC cache like it would do the DRAM cache on more expensive drives.

ptman · on Dec 11, 2018

It's probably more like the superblock not being updated to point to the newest data. Instead it points to old data that hasn't been garbage collected.

nightfly · on Dec 11, 2018

But if the controller gets into a weird state and "forgets" to flush the cache and then doesn't even look at the SLC cache on the next boot...

undersuit · on Dec 11, 2018

The SLC cache is managed by the controller, it's not going to forget to write it to the flash cells being managed as TLC/QLC, the controller is transparently storing your data in both. The controller knows if put some data in the SLC cache and some data in the QLC, and that that data needs to move from SLC to QLC, and that that other data in the SLC was marked deleted so now it can return that entire block to being addressed as QLC.

I don't know. I'm just trying to say that since flash isn't volatile the SLC cache isn't treated as volatile cache either.

blattimwind · on Dec 11, 2018

Like many early adopters, I too had a bunch of failed Vertex 2 drives and sometimes observed similar things. I think this might be because the drive lost some updates to its FTL about where it wrote new data, which could plausibly lead to both new files vanishing and changes to existing ones being apparently undone.

sebazzz · on Dec 11, 2018

Yes, the Sandforce controller in Vertex 2 was a real unstable beast. I lost maybe three or four Vertex 2 drives. All under warranty, except the last one which suddenly vanished and was not detected anymore.

OCZ does not exist anymore. Not sure of if this was one of the causes, but I would otherwise never buy a drive from them again.

userbinator · on Dec 12, 2018

but the firmware (or firmware memory) has gotten corrupted due to some programming error, electrical glitch, or cosmic ray.

A lot of SSDs also store their main firmware in the same flash that is used to hold user data... this is something which was done with hard drives too (hence why dead/dying HDDs sometimes show up as a small drive with a weird name --- that's the "recovery mode").

cimmanom · on Dec 11, 2018

What's involved in recovering the data if the controller fails?

sigstoat · on Dec 11, 2018

just based on having seen data recovered from other flash devices, and the advertising copy of these devices:

hot-air rework to lift all the flash chips off and get them hooked up to something (probably custom) that can read them.

if the controller was encrypting everything that went to flash, you also get to try and find the key in the controller's memory.

listic · on Dec 11, 2018

why something custom? shouldn't you just solder the chips into an identical donor ssd?

sigstoat · on Dec 11, 2018

a new controller is probably worthless to you, even if the old one wasn't storing the data encrypted. you don't have the map between real sectors and locations in flash.

your data is probably jumbled up one way or the other, but at least a custom board will let you read all of the underlying flash, instead of just the portions of it that a new controller would believe are in use. (keep in mind that the SSDs will have more physical blocks than they advertise logically.)

darkmighty · on Dec 11, 2018

If that's the case, I wonder if controller failure could be prevented by ECC controller memory? Or would software failure recovery be sufficient to make a highly reliable controller?

astrodust · on Dec 11, 2018

With drives you used to be able to rip off a controller from an identical model and wire it back on to read the data.

With SSD or NVMe the controller isn't really a separate component you can just replace. Maybe it's possible to saw off the broken part and bodge-wire it to a working surrogate, but that would be extreme.

A tear-down of a broken SSD might reveal more about what could be done.

mkirklions · on Dec 11, 2018

Does this explain why I often can access my 'dead' SSDs with my Ubuntu install rather than Windows?

I havent run into this issue for ~4 years, so I'm not sure if this is a solved problem or I got lucky a few years back.

soundwave106 · on Dec 11, 2018

Several years ago I was able to revive a "dead" SSD simply by updating the firmware to the latest version.

I haven't had a dead SSD problem in a while though, so I don't know how common firmware issues are now.

nneonneo · on Dec 11, 2018

FWIW I switched entirely to using Samsung’s SSDs and haven’t had any issues. They seem to be above average for firmware quality, but I don’t have statistics to back that feeling up.

Sometimes the dead SSDs will respond to a handful of commands anyway, meaning that you can attempt a firmware upgrade or reset. That’s the lucky case. But in the majority of cases I’ve seen, the firmware/controller goes into some hard lockup where it no longer processes SATA commands at all. I wish these things had JTAG ports...

pnutjam · on Dec 12, 2018

I have a laptop where the SSD wouldn't accept a Windows install, but worked fine with Linux. Months later, it took Windows with no issue...

lisper · on Dec 11, 2018

This is not a technological problem, it's a cultural one. These problems are easily fixed ("easily" by the standards of technical problems that regularly get fixed in other regimes). The reason they don't get fixed is that the customer reaction to failures like this is to rant at the mysterious storage gods that are making their lives miserable.

Needless to say, there are no mysterious storage gods. These are artifacts made by humans, and somewhere out there, there is an engineer who either understands why these failures are happening, or knows how to engineer these devices in such a way that when these failures happen, the cause can be determined, and then a design iteration can be done to reduce the failure rate and make the failure modes more robust. The reason this doesn't happen is that customers aren't demanding it. If major purchasers started demanding, essentially, an SLA from their SSD manufacturers, with actual financial consequences for violating it, you would be amazed how fast all of these problems would get fixed. But instead we vent our frustrations in blog posts and HN comments :-(

tzs · on Dec 11, 2018

> Needless to say, there are no mysterious storage gods

What about Consus, the god who protected grain storage in the ancient Roman religion? [1]. Or Eopsin, the Korean goddess of storage? [2]

[1] https://en.wikipedia.org/wiki/Consus

[2] https://en.wikipedia.org/wiki/Eopsin

rabidrat · on Dec 11, 2018

Those gods are not so mysterious, are they.

Filligree · on Dec 11, 2018

I'll give you Consus, but what's mysterious about Eopsin?

notatoad · on Dec 11, 2018

Physical devices don't have SLAs, because they aren't services. SSDs have the physical equivalent of an SLA, a warranty. I haven't seen stats, but in my experience the odds of an SSD dying within the warranty period is very very low.

If you want your storage to have an SLA, storage service providers exist, and will be happy to give you an SLA if you're willing to pay. But it isn't cheap.

lisper · on Dec 11, 2018

> Physical devices don't have SLAs

Yes, I know that. That's why I said "essentially".

> SSDs have the physical equivalent of an SLA, a warranty.

There are two orthogonal issues. The first is what happens when a device fails. A warranty addresses that. The second is how does it fail. Does it fail all at once with no warning, no way to perform post-mortem diagnostics, and no way to recover the data? Or does it fail with a gradual degradation of performance and capacity over time, and in a way that, if/when total failure occurs, the cause can be ascertained and the data still recovered somehow?

notatoad · on Dec 12, 2018

>Does it fail all at once with no warning, no way to perform post-mortem diagnostics, and no way to recover the data? Or does it fail with a gradual degradation of performance and capacity over time, and in a way that, if/when total failure occurs, the cause can be ascertained and the data still recovered somehow?

how does that have any relation to an SLA? An SLA is a promise of a certain amount of uptime, with financial penalty for the provider if not met. A warranty is a promise of a certain product lifespan, with a financial penalty for the provider if not met.

SLAs have nothing to do with providing diagnostic information.

Dylan16807 · on Dec 12, 2018

A warranty is like an SLA that only refunds you the percentage of time the service was down. It's not nothing, but it's extremely weak. They're more worried about annoying you than the pennies of lost profit, and your payout isn't nearly enough to make up for the trouble caused.

shittyadmin · on Dec 11, 2018

I've experienced a few seriously strange issues with modern SSDs, even some of the better ones.

I had a 512GB Samsung drive that became very slow randomly at doing IO operations, the whole machine would die for 10-30 seconds at a time once or twice a day while any process that tried to use the disk became blocked on IO. Then it'd come right back like everything was perfectly fine.

Issues like this definitely worry me, we're basically completely blind as to what those controllers and flash chips are actually doing. Not that it wasn't a similar situation with HDD controllers before, but at least it didn't seem as unpredictable.

Kye · on Dec 11, 2018

If the Bitlocker hardware encryption fiasco is any clue, the people who make SSDs don't know what's going on inside either.

LeftTurnSignal · on Dec 11, 2018

I've recently noticed similar issues with my older Crucial SSD in my 2012 mbp.

In the past 3 months, there's been maybe 5 times where I started my laptop and it took 5+ minutes to finish booting. Normally, it's 15 seconds. Once I'm logged in, doing anything takes forever, but it does eventually load. A power off, and back on got it "working" again but who knows for how long.

I've had this drive for 4-5 years now, so I'm impressed it's lasted this long.

spatular · on Dec 12, 2018

If on Linux, try running fstrim from time to time. More free space makes life of SSD's garbage collector / defragmenter much easier. I've anecdotally noticed that running fstrim reduces freezeups under heavy load from 1-2s to almost nothing on my Toshiba drive.

shittyadmin · on Dec 12, 2018

On Windows 10, I believe it's supposed to automatically TRIM free space on NTFS.

SSLy · on Dec 11, 2018

Linux? Try halving down your page caches.

shittyadmin · on Dec 12, 2018

This was on my Windows gaming machine.

docker_up · on Dec 11, 2018

I worked at a storage company, and they reinforced to us how not only does the OS lie to us, but the hard drives also lie to the OS. So you can't take anything you get from a hard drive as reliable, you have to test the data once you get it, ex through CRC, etc. Data can get corrupt at any time.

As densities of data get higher and higher, it doesn't take much to have a catastrophic data failure. The only way to protect against this is having multiple replicas of your data.

blattimwind · on Dec 11, 2018

Application to OS: Did you write that data out to disk?

OS: Yes, I did.

OS' inner voice: Nah, I didn't, but I'll do it soon, I think.

OS to drive: Those write commands, they're durable yet?

Drive: Sure!

Drive's inner voice: Nah, they're still in my RAM, but I'll probably write them out any time now.

mshook · on Dec 11, 2018

In a way we need someone/something like Backblaze doing a SMART report about SSDs to let us know which SMART metrics we should be monitoring...

Because they've shown most metrics are kinda useless or mean different things from one manufacturer to the other.

https://www.backblaze.com/blog/what-smart-stats-indicate-har... https://www.backblaze.com/blog/hard-drive-smart-stats/

pmden · on Dec 11, 2018

"We had one SSD fail in this way and then come back when it was pulled out and reinserted, apparently perfectly healthy, which doesn't inspire confidence."

We've experienced exactly the same thing. Our general course of action is to perform a hard power cycle of the server through IPMI - a warm cycle doesn't seem to work. I've always presumed it was down to dodgy SSD controller firmware given the way it suddenly stops appearing in the output of fdisk -l.

phaedrus · on Dec 11, 2018

I have three SSDs in three different laptops/desktops. In their current host machines they've been working flawlessly for a couple of years. Prior to my figuring out which SSD paired best with which host machine, I experienced intermittent strange and catastrophic problems (unreadable sectors to complete data loss) with each one. These were different brands, different capacities, bought in different years.

It's sort of a devil's bargain - the performance of SSDs is so much better that I can't pass up using it over a spinning disk even if they occasionally lose everything. There was a great game for the original Nintendo called "Pinball Quest". As you advanced through the game you could get upgrades such as side stoppers, stronger flippers, etc. You bought these items from a demon in between levels. After the red "Strong Flippers", the next upgrade was the purple "Devil's Flippers". The trick was that occasionally they'd turn to stone when you needed them and possibly cause you to lose the pinball. But they were such an upgrade over the Strong Flippers (when they weren't turned to stone) that you bought them anyway.

SSDs are kind of like that.

jug · on Dec 11, 2018

It also doesn’t help that at least Windows 10 is seemingly now ”Optimized for SSD” in the sense that performance is quite terrible on a traditional HDD. I imagine this will become more and more common as seek times and hard drive thrashing becomes practically invisible to users as well as developers. It will just get harder to go back as time goes on.

jplayer01 · on Dec 12, 2018

I just don't keep anything important on my SSD. My desktop's SSD is for Windows and games. All documents and other stuff goes on my mechanical drives and my AppData folder is backed up every night too. Everything on my laptop's SSD is either in cloud storage or in an external git repo. I'm 100% prepared for the certain eventuality of any of these SSDs going tits up unexpectedly and catastrophically.

I'm more worried about everybody else who gets an SSD and doesn't take the right precautions, because everybody sells SSDs as being so much more reliable than mechanical drives.

I worked as a PC technician for a while recently. Of the handful of catastrophic failures of mechanical drives we had, the majority of those were ones that were physically dropped, resulting in a head crash. Otherwise, we generally always managed to save data from failing drives. Any failing SSD we encountered was just dead, since there are really only two states: Fine or failed. There was nothing we could do except refer them to a data recovery company that charges thousands of Euros.

CosmicShadow · on Dec 11, 2018

I've had a few SSD's give me random issues, and they are so hard to pin down, sometimes they just work, other times they abruptly stop, or just aren't detected until like 3 reboots later and they work fine. After you've had troubles they make you feel like you are hanging on a hope that the ground won't fall out from under you.

You also CAN'T HEAR if there is an issue, which acts as another warning sign that something might be going wrong or will go wrong soon. Loud tickings or clickings or overwork is a sure sign to start backing up and get ready to buy a new drive!

Severian · on Dec 11, 2018

Call me crazy, but I don't think that a Crucial MX300 is the best choice for an enterprise worthy ZFS drive. I get what the author is concerned about, but I wouldn't be that surprised that a consumer level SSD failed in what sounds like a heavily used fileserver.

AnIdiotOnTheNet · on Dec 11, 2018

To be fair, the 'I' in RAID stands for 'inexpensive'.

minikites · on Dec 11, 2018

I think a lot of people know the I as "Independent" these days since RAID is usually an enterprise feature with enterprise costs.

AnIdiotOnTheNet · on Dec 11, 2018

Modern takes on RAID are built in to home OSs these days, and consumer grade NAS devices are pretty common. It really isn't a fancy enterprise-only technology at all.

olavgg · on Dec 11, 2018

The most scary thing about using consumer grade SSD's in a file server is the amount of writes you do. For many use cases this is fine.

Severian · on Dec 11, 2018

Very true. The slack reserve space, RAM buffer (and/or battery), and NAND process are the main things that make an enterprise drive.

The author doesn't specify a metric for writes, but based on the MX300 specs that I find, it can go up to 219 GB/day for the 2TB drives he uses, or 87 GB/day for a couple of the 525GB he still had. He doesn't specify though.

pythonpatrol · on Dec 11, 2018

Crucial? I can tell you most of them got spectek's chips inside, no wonder why they fail.

teh_klev · on Dec 11, 2018

I've been running Crucial SSD's for years. One of them is already around seven years old (yes, yes I do backups).

So the citation of some evidence is required to back up your claim.

gruez · on Dec 11, 2018

what's wrong with spectek chips?