Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

1000% this. HN loves to talk about Dropbox. I spent most of my (short, praise God) career at Dropbox diagnosing a fleet of dodgy database servers we bought from HPE. Turns out they were full of little flecks of metal inside. Thousands of em, full of iron filings. You think that kind of thing happens when you are an AWS customer?

If you are sophisticated enough to engage an ODM, build your own facilities, and put hvac and electricians on 24-hour payroll, go on-prem. Otherwise, cloud all the way.



That's not quite where I would draw the line, I don't think. I used to work for an ISP and we were kind of split between AWS and on-prem. Obviously, things like terminating our customers' fiber feeds had to be on-prem, so there was no way to not have a data center (fortunately in the same building as our office). Moving our website to some server in there wouldn't have been much of a stretch to me, at the end of the day, it's just a backend for cloudflare anyway.

Like most startups, our management of the data center was pretty scrappy. Our CEO liked that kind of stuff, and we had a couple of network engineers that could be on call to fix overnight issues. It definitely wasn't a burden at the 50 employees size of company (and that includes field techs that actually installed fiber, dragged cable under the street, etc.)

We actually had some Linux servers in the datacenter. I don't know why, to be completely honest.

So overall my thought is that maybe use the cloud for your 1 person startup, but sometimes you need a datacenter only and it's not really rocket science. You're going to have downtime while someone drives to the datacenter. You're going to have downtime when us-east-1 explodes, too. To me, it's a wash.


I mean, you did want to manage bare metal servers, right?

AWS almost certainly gets batches of bad hardware too. And if your services are running on the bad hardware, you can't have a peek inside and find the iron filings. For servers, this is probably not too bad, there used to be articles about dealing with less enthusiastic ec2 vms since a long time, and if you experience that, you'd find a way. AWS has enough capacity that you can probably get vms running on a different batch of hardware somehow. With owned hardaware, if it was your first order of important database servers and they're all dodgy, that's a pickle; HPE probably has quick support? once you realize it's their hardware.

If your cloud provider's network is dodgy though, you get to diagnose that as a blackbox which is lots of fun. Would have loved to have access to router statistics.

There's a lot of stuff in betwren AWS and on-prem/owned datacenter, too.


> If you are sophisticated enough to engage an ODM, build your own facilities, and put hvac and electricians on 24-hour payroll, go on-prem. Otherwise, cloud all the way.

I imagine the entire sentiment of the comments is because FedEx is one that really should be sophisticated enough.


Not really a meaningful dichotomy.

There is a smooth curve between cloud and dedicated DCs, which has various levels of managed servers, co-location, and managed DCs. (A managed DC can be a secure room in a DC "complex" that shares all the heavy infrastructure of DCs.)

Primarily, the FedEx managers are committing the company long-term to Oracle/Microsoft platforms. Probably mostly to benefit their own careers.

Outsourcing hosting and management of DCs would have been something different, and probably healthier for FedEx and the industry.


> You think that kind of thing happens when you are an AWS customer?

You bet it does! But as the AWS customer you'd never notice because some poor ops dude in AWS-land gets to call up the vendor and bitch at them instead of you. It ain't your problem!


Why do you buy servers with metal flakes in it? No quality controll on your side?


Are you saying that part of the expected savings from going on-prem is that you will have to disassemble equipment bought from major OEMs and examine it for microscopic metal dust?

That doesn't sound like it will save much money, honestly.


They’re saying it’s a surprise to hear that Dropbox doesn’t know what QC and order acceptance means. And it is, I agree. That you spent the time investigating it, implying those servers were in production, is a shibboleth to those of us that know what we’re doing when designing hardware usage that Dropbox doesn’t. It is, however, your self sourced report and we don’t have an idea of scale, so maybe they do and you’re just unlucky.

And no, operators don’t disassemble to perform QC. And no, I could hire an entire division of people buying servers at Best Buy, and disassembling them, and stress testing them, and all of that overhead including the fuel to drive to the store would still clock in under cloud’s profit margin depending on what you’re doing.

You’re of course entitled to develop your cloud opinion from that experience. That’s like finding a stain in a new car and swearing off internal combustion as a useful technology, though, without any awareness of how often new cars are defective.


Many hardware problems do not surface at burn-in. Even at Google, the infamous "Platform A" from the paper "DRAM Errors in the Wild" was in large-scale production before they realized it was garbage.


Filings from the chassis stamper, which yours certainly were given the combination of circumstances and vendor, are present when the machine is installed. If you’re buying racks, your integrator inspects them. If you’re buying U, you do. It’s a five minute job to catch your thank-God-my-career-was-short story before the machine is even energized, which I know because I’ve caught the same thing from the same vendor twice. (It’s common; notice several comments point to it.) Why do you think QC benches have magnifiers and loupes? It’s a capital expenditure and an asset, so of course it’s rigorously inspected before the company accepts it, right? That’s not strange, is it?

You can point at Google and speak in abstracts but it doesn’t address the point being made, nor that your rationale for your extreme position on cloud isn’t as firm as you thought it was. Is Dropbox the only time you’ve worked with hardware? I’m genuinely asking because manufacturing defects can top 5% of incoming kit depending on who you’re dealing with. Google knew that when they built Platform A. The lie of cloud is that dismissing those problems is worth the margin (it ain’t; you send it back, make them refire the omelette, and eat the toast you baked into your capacity plan while you wait).


Are you saing you just buy some server unpack them and throw them into production.....oh man...the lost art of systemadmin, if your system is not stable (in testing) you for sure disassemble it, or send it back. How much money have you lost playing around with your unstable database? Was it more then test your servers for some weeks? Do you buy/build software and throw it into production without testing?

You can test your stuff and be still profitable henzer aws etc would make no money otherwise....you know they test their server much more (sometimes weeks/month)


Did they pass typical memory/reliability tests and so on?


Maybe in the first day's they survive it, but the flakes are 99% from the fans/bearings, that's why you test servers at max load for at least 1 week and HD's for 2-4 weeks.

But i don't think they made even a initial load-/stresstest.

Unpack it, trow it into the rack, no checking of internal plug's just nothing...pretty sure about that.


Metal chips is squarely in the long tail of failure modes that you can't really anticipate (but of course really easy to be smug about in hindsight). It is also extremely unlikely the bearings, most likely these are from chassis frames assy not cleaned up properly.


I had some metaldust and it was from bearings, but op said something flakes and then microscopic particles. Particles = bearings, flakes = chassis or even stickers, but anyway just because of transport you dont trow a server into production without testing and inspection.

I am beeing smug about not testing your hardware as you do it with software....shitty testing is shitty testing, counts for software hardware firmware and everything between. Even for your diesel generator ;-)


I heard tale of a banking centre that had a diesel generator installed by a local company.

Load and simulated power failure tests all passed.

Then some time later there was a total power cut and that's when they realised the generator had an electric start wired to the mains supply.


And there is also the true story when "someone" forgot to fill the tank after 5 years of regular monthly tests, then the real thing happened.

> had an electric start wired to the mains supply.

But that's a good one, humans being humans...but it worked every time before today ;))


Wait, are you saying that an org needs expertise to QC all of the the hardware they procure? How expensive is that? How easy it is to hire that type of QC?

Do you see how these costs all start to add up?


Well, are you saying that an org needs expertise to inspect faulty cars, like, by calling a mechanic?

Is that like too much these days for companies that owb fleets of cars? is opening a server harder than checking whars wrong with a car? like a cable comes loose and that's gane over?


If I procure a fleet of cars I expect none of them to be faulty...how about you?


>I expect none of them to be faulty

So you don't even test the car's, you just expect that the tire pressure is correct, tank is full?

Expect that something "just" works is exactly why pilots have checklist's.

Expectations are the main point for disappointments, you would never do that with software right?


The point, which you seem so dedicated to avoiding, is that "in the cloud" these steps are not my problem. Inspecting a literal shipload of computers for subtle defects is a pain in the ass. Amazon does it for me. When I get on an airplane I do not personally have to run the checklists. The airline does it for me.


>The point, which you seem so dedicated to avoiding

Not true the point was you pay for it (cloud), or you do it yourself (but then do it right, and not like a amateur who build's his first "gaming-pc").

And if you do it yourself you can still be very much competitive vs cloud.


> (but then do it right, and not like a amateur who build's his first "gaming-pc").

Again, still avoiding the point, but oddly enough proving the point. You assume everyone isn't an amateur and knows how to build and maintain server hardware. Furthermore, because the market doesn't have enough talent to support all of the companies that exist, consolidating this to a few vendors who do have the expertise is what makes sense (economies of scale) and is what the market already decided.


>Again, still avoiding the point, but oddly enough proving the point.

Please read, that was my comment:

>>Not true the point was you pay for it (cloud), or you do it yourself

>You assume everyone isn't an amateur and knows how to build and maintain server hardware.

Yes that i assume, correct. Otherwise i would not call it "maintaining", is a amateur maintaining your car? Your software? If you have just amateur's handling your hardware it's probably better to pay a cloud-provider or pay a integrator todo that.


> you would never do that with software right?

Hilarious you used this as an analogy since software development shops are notorious for cutting corners when it comes to QA.


And that's why you have to test the software before production right? ...Hilarious indeed.


> you would never do that with software right?

You facetiously implied that every company fully tests software before it gets to production. Oh boy, do I have news for you...

Note the word "fully" as the variations of what gets tested is so broad, I don't even know where to start to explain this to you.


I never wrote "fully", but you test your software (i hope). Your just try to justify bad work-ethic.

>Oh boy, do I have news for you...

Nah it's ok, just happy that i have colleges with a much better mindset and risk-management understanding.

And i stop here, since you try to change what i really wrote.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: