Originally TRIM was an un-queued command; all writes had to be flushed, then TRIM executed, then writes could continue. This was bad for performance with automatic on-file-delete trim, so everyone wanted a trim command that could be put in the command queue along with writes. Many new drives have this.
It turns out that Samsung 8XX SSDs advertise they support queued trim but it's buggy. The old TRIM command works fine.
There are in fact lots of "quirks lists" and "blacklists" in the kernel and virtually all computers require some workarounds in the linux kernel for some buggy hardware they have. Pretty amazing when you think about it.
EDIT: another closely related example is macbook pro SSDs and NCQ aka native command queuing. They claim they support it, but on many it's buggy. It gets better though; the linux kernel just starting trying to use such functionality by default relatively recently.
these sort of things are, as you can see, very confusing and frustrating to track down, identify, and find a general fix for
EDIT2: now that I actually read the kernel bugzilla entry further, it's more recently come to light the actual problem with recent macbook pro SSDs is MSI (efficient type of interrupts)
The thing is, almost all hardware accessed through drivers has tons of bugs, at least it's nowhere near as close to "bug-free" as are things like CPUs or DRAMs which cannot hide their bugs behind drivers. The thing that one can hope to work reasonably is a piece of hardware plus an accompanying driver which knows to hide that hardware's issues.
So another way of putting what you said would be "on Linux there's no working driver for that piece of hardware, unlike on Windows where the 'proprietor' went to the trouble of supplying such a driver."
I didn't see him thinking that. Just that CPUs do not have as many bugs as other hardware - which I think is quite true. With CPUs a larger portion of bugs are found, and smaller bugs matter because they are not hidden by proprietary drivers.
FWIW, memory has plenty of bugs too. With respect to the original point, these are usually not visible to drivers (unless you count EDAC) because they're handled at the chipset level. However, for certain kinds of systems - especially embedded - that don't have chipsets these issues can become painfully visible. My own exposure to this was at SiCortex, where the memory logic was directly on the same single die as everything else that comprised a node.
I assumed that these drives had the same controller chip and the same firmware base as the consumer samsung SSDs, but with higher quality nand and some firmware tweaks. It's very hard to find technical details about these enterprise drives on the internet (compared to the consumer drives).
I guess the smartctl command proves it, these enterprise samsung SSDs do not have queued trim enabled.
It would make sense for enterprise drives to be more conservative and lag on feature set. But it's very surprising that enterprise drives are corrupted by original un-queued trim, they're supposed to have more validation, and that's a very common feature.
It sounds to me like even when it's the fstrim utility, which uses some ioctl() to tell the kernel to trim free regions in a range on a filesystem, the kernel ends up causing the queued trim command to be used if available.
The "blacklist" does not appear to have any constant to blacklist old-style trim, only NCQ_TRIM (and other odd stuff, most notably all NCQ usage).
This makes sense, because if some SSD advertised old-style trim but was corrupted by it, then it would be found and fixed sooner by these vendors, because Windows 7 would exhibit the corruption.
I see the addendum to your post; touche, I guess these drives do indeed lack queued trim, and have some issue with plain old trim. That's rather surprising, to me... I was going to say "especially for an enterprise-grade drive" but I'm not so sure...
Nice debugging story. When I was at NetApp there were lots of times when drive firmware for the 'less used' options would fail. On the fiber channel drives the 'write zeros' command which was supposed to zero a drive was notorious in its in ability to achieve something that simple. When Google looked at (I don't know if they finally deployed it) the disk encryption technology it worked differently disk to disk and firmware rev to firmware rev. I think it was Brian Pawlowski at NetApp that said "You can count on two things working right in a hard drive, read, write, and seek." The joke being that you needed all three of them to work for reliable disk operation.
Linux 4.0.5 includes a patch that blacklists queued TRIM for the buggy drives. Windows and OS X apparently don't support queued TRIM at all, so they're unaffected.
The drives we have detected the issue had still un-queued TRIM. I have reached to one of the kernel I/O developers for help and he confirmed that it is not related.
But isn't the blacklist you link to in the article specifically for queued TRIM? E.g. https://github.com/torvalds/linux/commit/9a9324d. SO either that blacklist has nothing to do with this issue (in which case it probably shouldn't be linked from the article), or it does, and we're talking about issues with queued TRIM.
To me, this sort of thing brings home the value of not running your own machines. Sure, Amazon's/Google's clouds have quirks, but it's far less likely that you're going to have to debug faulty hardware in this way. It sounds like a team of more than one person worked on this at least part-time for weeks -- how much is that worth? It's not just the cost of hiring extra people to do the work; often small companies simply can't hire enough good people -- when you do find them, do you want to squander them twiddling servers?
If something similar happens to you on "cloud" infrastructure, you're very limited in what you can do to diagnose or work-around the problem.
At a place I used to work at we had a reasonably large cluster of Windows boxes on Amazon. Randomly, Windows machines on Amazon would suddenly stop accepting new TCP connections.
This means that machines would be running fine, and then half your cluster starts dropping offline. At the time when this happened to us, there were no other reports we could find of this happening.
Turns out, it's some bug in the Xen Virtual NIC driver that wasn't running the offloaded TCP cleanup, and so eventually the system couldn't accept any new connections.
Once we figured out was happening we could pre-emptively reboot boxes, but that was a problem for us for about 6 months iirc.
There's probably dozens of these bugs affecting someone on these cloud platforms at any one time. But because you have no access to the hardware, you don't even have the option of saying "Screw it, lets just get different hardware". You're at the mercy of your cloud provider.
There is no cloud - just other people's computers.
Many use-cases just require the job to be done on your computers due to security and privacy reasons. Yes, Amazon's and Google's services are in some ways less secure than your own computer, because they are hosted by companies which are subject to a government that doesn't value privacy, not even of it's own citizens. That means said government can, just to give a concrete example, NSL the companies to give up all they have about you, and you wouldn't even know notice.
When the government puts national security above fundamental human rights there is something dangerously wrong.
Thinking about individual computers will lead you astray. There are, rather, sets of machines (from single boxes to entire data-centers) that are managed by a given sysadmin staff. The more machines they manage, the more likely it is that problems will have institutionalized and operationalized solutions.
A cloud is just a sysadmin staff with a Sufficiently Large Deployment to have ironed out all the kinks in their hardware.
Or the more likely they'll not do advanced stuff in order to increase profit, as long as there is a microscopic delta better than running it yourself for most customers most of the time on average. The microscopic delta may not be measurable or noticeable by the end users of course.
Assuming their business model isn't assuming an infinite supply of future customers so in the short term as long as revenue per customer exceeds cost of sales per customer we're all good, etc. Support costs that exceed average cost of sales must be beaten down/ignored, otherwise its cheaper to let them go and have sales "earn" a replacement customer.
Finally their sysadmins work for them to meet their corporate objectives of various meaningless metrics which have no necessity of aligning in any way with your own corporate objectives.
True, by the literal definition. I continue to interpret "cloud" as "that mysterious part in the middle of the diagram which is a clean encapsulation of Somebody Else's Problem that never bothers you"; obviously, there are no true "clouds" (and there cannot be) by that definition.
But people can try, and they can get close; and one can say that something is a cloud to the degree that it manages to fulfill the "amorphous shape in your diagram you don't have to worry about" promise. So there are some 80%-clouds, some 95%-clouds, some 99.995%-clouds, and so on.
The point I was trying to make is that the degree to which a cloud achieves that promise is correlated to the size (and longevity, and homogeneity) of the deployment. The more man-years have gone into taking care of a given server type at a given DC, the more institutional knowledge is ready-at-hand to solve a problem on your machine of that type, and so the fewer issues become emergencies that break out of the "cloud" abstraction to require your attention.
And it was a reply to the parent precisely because a security problem is just such an "emergency" that represents a failure of institutional knowledge: I would much sooner trust AWS's KMS to not leak my private keys than I would trust a machine I was running myself to not leak my private keys. I'm a much worse sysadmin than AWS!
Lets do some maths on that claim:
AWS: c3.8xlarge with 32 "CPUS" and 60 gigs of ram.
For the machine alone its $1200 a month. Bear in mind its on a shared infrastructure, with noisy neighbours. You'll see about 10-30% CPU steal. In practice you'll see performance about half that of a real machine (from my comparisons)
Then you'll need to factor in disks as well. First things first EBS is dogshit slow. Yes ephemeral disks are fast, but then they die, so you're in the same situation. however you need 10gig networking to get low latency, avoid puncturing the cache etc,etc,etc,
for EBS the maximum IOPs you can guarantee to get is 20,000, and you need 1tb for that.
for the Iops, thats $1300 a month + $125 for the 1 TB of storage.
so a month, per machine it'll be $2625. $31500 per machine, per year.
Every 6 months, you could buy a new machine, which is faster than the fastest EC2 instance + EBS.
Now, the OP stated that they have more than one machine. Obviously one could use reserved instances. However similarly one could negotiate volume discounts.
There is of course the cost of internet and cooling, you're looking at around $500 a month for half a rack, depending on power consumption. (if you're colo'ing)
From a valuation point of view, having hardware counts towards your value, as its an asset you actually own. More importantly you can use it to lower your tax bill, and reduce your run rate, in exchange for an up front cost.
Now, if you have a lot of bursty traffic, that doesn't require much DB activity, then AWS is perfect, as the elastic IP load balancer allows you to spin up machine on demand. However thats not that helpful for Databases. Sure you can warm migrate from a EBS snapshot, but you'd best do it quick, otherwise you'll overload an already overloaded DB.
With our architecture, HW requirements, the price of HW and the price of the cloud VMs, even working on this for a week or two saves us significant amount of money both short-term and long-term. The side effect is that we now have tools to recover servers way faster and allows us to do things we have not thought about before.
Agreed. Additionally, some business models simply don't mesh with cloud infrastructure pricing no matter the volume. There are definitely advantages to using cloud services, but most of the time bare metal gets you more hardware/performance at a lower cost in the long run, even when you factor in everything else that it entails.
The thing people forgets, is that the cloud provider have the same issues and expences. That cost is passed on to the clients. Now they may be more efficient ect. but once you reach a certain scale, and it's less that people think, you might as well get it done in house if you can find qualified people.
How can people forget, when that cost is right there in the price tag? If anything, it's easier to overlook the costs of running your own hardware, since they aren't immediately apparent.
My experience is that people don't understand cloud pricing at all.
First of all they tend to not look at monthly prices, and are seduced into thinking their instances are cheap. Secondly they are seduced itnto thinking they are spending less ops time, though in my experience it's the reverse. Thirdly, people "forget" about extras like bandwidth costs (which are extortionate at all the big cloud providers), extra storage volumes etc.
Then when people get the bill, it often gets back-rationalised as being ok because it's cloud so it must be cheap.
The greatest innovation AWS did was finding a way to get people to pay absolutely insane rates for hosting.
They simply underestimate the ops cost, and often focus on the monthly cost. The thing that cloud providers like AWS are good at, and IMO, the only reason you should choose them, is when you have highly variable loads. Dynamic scaling is something only they can do because they have such a massive scale. Even if you're relative small and cannot justify hiring a sysadmin, there are plenty of consults out there you can hire.
It feels like Samsung used the Linux community here as a free testbed.
Samsung knew that only Linux supported queued trim, so releasing it without proper testing is just externalizing the disproportionately increased cost of testing to the Linux community.
In this case it was un-queued TRIM (I forgot to mention it in the blogpost). We have reached to Samsung and although it looked good at the beginning now they are silent for more than a month without any progress.
With Samsungs finished-forms walling the company already tells Linux users to not expect any support, at all. So, that is consisting with the testbed-theory.
I am sorry. I was too dumb to phrase it understandibly, so the shame is on me. The sentence smelled problematic to me but after several times rereading it I concluded it is understandable. I should have gone with my guts…
Here is another try: Samsung's support walls with prewritten answers that say Linux is open and thus Linux is unsupported, and this action of Samsung is consisting with the testbed-theory.
Only because up until recently getting in the HD game was prohibitively expensive due to the engineering and capital requirements for designing spinning disks. Now anyone can buy some flash media, a pre-cooked controller firmware, combine the two and sell at competitive rates. There are something like five or six competitive SSD makers right now and many more bottom feeders. There are two competitive spinning disk makers and its been that way for decades, ignoring the occasional small third-party player like Hitachi.
But they're still in business making hard drives, under the DeskStar name. Seagate had a round of failures at one point. I'm sure there are people out there who've sworn off WD as well.
Who? If you're referring to IBM, then they're not. IBM sold off their entire hard-disk division to Hitachi (which a few years ago sold it off to WD).
If you're referring to Hitachi, then they did continue it, yes, but they bought it on a fire-sale, and their name was not attached to the original affair, so they presumably did not see it as particularly risky.
Is Seagate back to being good? We had 100 drive failures in a batch of 120 HP netbooks, and we had a large number of our server drives go down the tube. I switched to WD at that point.
The EVO was well covered, but these being from the PRO line it's even worse... Intel SSD have been praised for a long time, they seem the only stable brand around.
Yes, it seems. Still I don't like them much, since they changed to compressing controllers, as much I know, SandForce controllers which have a mixed reputation, as much I remember. But it looks, that Samsung is not better.
Strange, Samsung 840/850 evo/pro are considered [1][2] among the best consumer SSDs. The issues article mentions do not exist on Windows, the SSDs are very reliable there. I suspect it's not only Samsung fault. Are we sure Linux handling of TRIM operations is absolutely correct?
The problem is that "absolutely correct" is a slippery concept. Even the most tightly written standard is likely to have some areas of ambiguity through which bugs can creep. If the way that a particular device deals with that ambiguity is known only to those under NDA, then you can have two drivers that are both "absolutely correct" per the standard but only one actually works in all the edge conditions.
Personally, I find Samsung has a "it boots? Fine then ship," mentality to pretty much all things. Their buggy phones, buggy SSD's, buggy TV's, etc. I wouldn't recommend them, even though they do well on SSD speed tests (which are often gamed by on-board ram caching).
I have this running on my Ubuntu Thinkpad with A Samsung 840 Pro as a weekly cron job. should I turn it off ?
#!/bin/sh
# call fstrim-all to trim all mounted file systems which support it
set -e
# This only runs on Intel and Samsung SSDs by default, as some SSDs with faulty
# firmware may encounter data loss problems when running fstrim under high I/O
# load (e. g. https://launchpad.net/bugs/1259829). You can append the
# --no-model-check option here to disable the vendor check and run fstrim on
# all SSD drives.
exec fstrim-all
Pretty disappointing to see some of those Samsung drives on the list, because in some of the other tests/surveys I've seen they seemed to be among the better choices. Sigh I guess Sturgeon's Law applies to SSDs too.
Using SAS SSD drives on a server is a bad idea for many reasons. One should use PCIe cards, that sit directly on the PCIe bus, such as FusionIO or SanDisk. They have been tested and retested (e.g. by Facebook), without the unnecessarily added complexity of SAS/SATA protocols. The I/O performance is also about 20x.
I don't think that testing by Facebook is going to help you unless you are using the exact same model as they are and are assured of using their exact firmware. At work we use SAS SSDs in large quantities and the firmware we use is customized to us (based on the mainline one). Do not assume that a bug that was fixed in our firmware was necessarily fixed in the normal one. One would think it would but it is possible that it wasn't ported to the mainline firmware.
Sometime around the end of 2013 I started getting frequently lost data and corrupted filesystems upon reboot.
After much search and about 4-6 months into the issue, I found out that the culprit were the queued TRIM commands issued by the linux kernel to my Crucial M500 mSATA disk. The Linux kernel already had a quirks list with many drives, including some of the M500 variants, just not mine.
I added my model, compiled the kernel and the nightmare ended. I proceeded to submit a bug report and a patch. The patch got accepted (yay!) and the bug report turned to be very useful for other people with the same problem but different disk as I included the dmesg output that was specific to the issue. This meant that they could now google the errors and get a helpful result.
Such is the nature of free software; you are allowed to fix your computer yourself. :)
I've worked on some interesting SSD deployments / experiments a lot over the past 12 months. Quite honestly - I wouldn't go anywhere near Samsung products regardless of their 'PRO' labelling or otherwise.
We have had great success with both Sandisk Extreme Pro SATA and Intel DC NVMe series drives, we've also recently deployed a number of Crucial 'Micron' M600 1TB SATA drives that are performing very well and so far haven't given us any issues.
I've done similar over the last three years and had good luck with the Crucial drives. However if you take a look at the Linux Kernel patch they link to (search for "don't properly handle queued TRIM"):
https://github.com/torvalds/linux/blob/e64f638483a21105c7ce3...
There are Crucial SSDs on the list. I'm going to be keeping a closer eye on them now.
Yeah I saw that - although that's the older, now discontinued series that has a different controller and doesn't show the same consistent performance as the newer M600 drives.
In theory, yes. Unfortunately, every time my Btrfs filesystems have encountered a hardware glitch, it has happily trashed the filesystem beyond recovery (including both drives in a RAID1 mirror, one of which was perfectly OK). I use ZFS now, and while some features are compatable with Btrfs, the implementation quality, documentation, and feature completeness, and tool quality set it well above where Btrfs is at.
I fully second that: I'm using btrfs for / and ZFS for /srv. So many rashed filesystems beyond recovery on btrfs, so many joy, stability and easy tools for ZFS.
I'm really about to consider to migrate / to ZFS now.
I've had issues with these samsung 8xx drives, unfortunately they all happened at once. I gave up on their RMA/warranty process because I was bounced back and forth between the same two numbers a few times. Either side said that the other was in charge of this process(samsung bought the SSD division from seagate... or was it seagate that bought the HDD division from Samsung? To this day I have no clue.).
Can someone clarify the article's claim that these Samsung drives are really "broken" as such? We have a few of these on 3.13 and 3.16 kernels and ext4 with no problems. It seems that there must be something unique to their application in order to expose these trim failures.
Do you have the "discard" mount option enabled? Do you have a cron job that runs the "fstrim" command? It's possible your systems are not running trim. Or maybe your ext4 filesystems have little activity and you haven't had enough corruption to notice yet :)
Also, some Samsung 800 series drives only gained this bug in a recent firmware update (840 EVO specifically).
The 840 EVO joined the club with firmware EXT0DB6Q, which itself is a nasty little hack around a fundamental design problem with the tightly packed NAND cells.
Linux 4.0.5 ships with the patch linked above, but for a while you had to roll with a kernel built from source.
EDIT: The blatant file corruption issues only manifested after updating to firmware EXT0DB6Q.
I'm not sure exactly when the first 840 EVO firmware which advertised queued trim support (along with SATA 3.1/3.2 support) was released, but I think that if you last updated firmware (or acquired the drive) before October 2014, you're safe.
I currently can not notice any performance degradation, and I bought the drive in may 2014, with no further updates. (Unless Arch Linux automatically applies firmware updates, but I doubt that)
I'm so sick of this TRIM. Constant configurations needed because of it, constant care like "this thing you better don't do on SSDs". And then problems like this.
Do you think there'll ever be SSDs that don't need it?
I remember started incorporating SSDs into their computers and didn't support TRIM. Windows users were telling Mac users their Macs were practically obsolete because it couldn't do this one thing that was enabled for Windows. of course they sent that back to Apple and Apple replied, for years, you don't need it.
Eventually, they relented and enabled it on their SSDs. I'm pretty sure the marketing and engineering butted heads over this one stupid bullet point.
Except without TRIM you'll fill all your blocks and kill performance of your fancy $1500 Apple when the SSD is performing a dozen operations to create a space to perform writes instead of one operation on a properly TRIM'd drive.
Apple didn't do this because of "windows users whining" but because they knew they didn't want an angry mob of customers wondering why their drive is 10x slower than it was on day one.
Arguably, idle GC was "good enough," for some use cases but probably not for drives that aren't sitting idle all the time and on many hours a day. Even then, Apple probably didn't want to tell its customers "let it sit out overnight" to regain performance when supporting plain-jane TRIM was a trivial addition.
On-board GC + OS-driven TRIM are considered the optimal solution for SSD's.
Because it's hard to make an automatic monitoring system that reliably distinguishes between "a failure occurred but everything is fine" and "a failure occurred and now everything is on fire".
We have multiple different pages. In our cluster we have 3 machines and if one of them is unavailable because of broken network, we do not page. In this case the page came as an application error that the application was not able to cope with. When we have issue that we have seen before and the server can handle it on its own, we do not page.
I have one of the affected drives mentioned in the article in my development laptop - the Samsung SSD 850 PRO 512GB.
As one of the most expensive SSD drives available on the market, it was disconcerting to find dmesg -T showing trim errors, when the drive was mounted with the discard option. Research on mailing lists, indicated that the driver devs, believe it's a Samsung firmware issue.
Disabling trim in fstab, stopped the error messages. However it's difficult to get good information about whether drive performance or longevity may be impacted without the trim support.
Trim really is only a helpful message when the drive is near full so the GC can preemptively zero blocks and retain good write speed. Without trim, the firmware must wait until it gets a write for a particular block before it know it can be erased.
If your drive has reasonably with unprovisioned space, it can simply work around the missing trim commands - this however, is theory, I do not know if the firmware actually does this. This is the exact thing that makes some drives better than others when working without trim.
Thanks. I'll probably end up creating an unprovisioned partition. It's frustrating, exactly because of the uncertainty re future performance. Especially given the price premium for pro/enterprise level hardware.
You can research if the firmware understands MBR and GPT - if it only understands one, then you have to use that. Alternatively use Samsungs own software (I think it's called Magician, can't remember exactly), it will make sure you have the unprovisioned space setup correctly.
Interesting! I sometimes work with SSDs as storage media for cameras (where Sandisk is the most popular brand by a mile) and I seriously doubt any camera firmware is doing drive maintenance. From what I know of digital imaging technicians, neither are they - if a drive starts acting up in any way, the usual policy is to just take it out of service immediately, recover anything that was on it, dump it, and buy a replacement.
How do you disable TRIM on common distros? Under Ubuntu, is it just preventing /etc/cron.weekly/fstrim from running, or is there more to it? What about CentOS, etc?
Undoubtedly the same issue happened to me on an 500GB 840 EVO with NTFS.
SSD zeroed out a part of the disk during runtime, as I watched this happen music was playing from this drive. It was mounted from Ubuntu MATE 15.04 and playing a music library through Audacious. Suddenly music glitched and IO errors began appearing. Rebooted to a DISK READ ERROR (MBR was on the EVO). Ran chkdsk from USB and it showed a ridiculous amount of orphaned files for ca. 1h. Once finished the most frequently accessed files had disappeared. Download folder, Documents folder, some system files. Of course, some of the files could've been recovered had I not ran chkdsk off the bat, bot nonetheless it's an approximate measure of failure impact.
I began being suspicious of 840 EVO when sorting old files by date became fantastically slow. If you have a feeling this has happened to you recently - buckle up for a shitstorm.
It turns out that Samsung 8XX SSDs advertise they support queued trim but it's buggy. The old TRIM command works fine.
https://lkml.org/lkml/2015/6/10/642
There are in fact lots of "quirks lists" and "blacklists" in the kernel and virtually all computers require some workarounds in the linux kernel for some buggy hardware they have. Pretty amazing when you think about it.
EDIT: another closely related example is macbook pro SSDs and NCQ aka native command queuing. They claim they support it, but on many it's buggy. It gets better though; the linux kernel just starting trying to use such functionality by default relatively recently.
https://bugzilla.kernel.org/show_bug.cgi?id=60731
these sort of things are, as you can see, very confusing and frustrating to track down, identify, and find a general fix for
EDIT2: now that I actually read the kernel bugzilla entry further, it's more recently come to light the actual problem with recent macbook pro SSDs is MSI (efficient type of interrupts)