That seems to be a common take on HN - cloud is too expensive.
I'm curious whether the folks claiming that have any data center ops experience.
Because, personally, I'd rather retire than deal with Dell, HP, Cisco, fibers, cooling issues, physical security, hardware failling... And that's just the hardware. Then you still need to pay VMWare for a decent virtualization platform, monitoring tools, etc.... Seriously, no amount of money would make me work in a DC again.
I believe companies selling bare metal as a service are a happy compromise of cost and convenience, though.
ML workloads definitely cost a lot of money. Even for a preemptible VM, A100 GPUs cost $0.88/hr/GPU. That's $624 a month for a single GPU and only the 40GB model. Want a dedicated 8 GPU machine in the cloud to do training with? That'll run you around 16 grand a month. Do that for 2 years and you may as well have bought the device. Want to do 16/24/40 GPU training? Good luck getting dedicated cloud machines with networking fast enough between them so that MPI works correctly, and prepared to give up your wallet.
Also, that's just compute. What about data? Sure cloud accepts your data cheaply, but they also charge you for egress of that data. Yes you should have your data in more than one location, but if you depend on just cloud then you need it in different AZ which costs even more money to keep in sync and available for training runs.
I think for simple workloads and renting compute for a startup, cloud definitely makes sense. But the moment you try to do some serious compute for ML workloads, good luck and hope you have deep pockets.
The other thing is nVidia try and sell GPUs with similar performance at two very different prices. One price for data centres and a quite different price to kids. If you do the job yourself you can often get away with using the much cheaper gamer grade cards for AI work (unless you need a lot of VRAM), whereas such as AWS can't do that and are required by nVidia to use the considerably more expensive cards. If your workload will fit on a gamer grade card there's no contest on price between an on-prem system and the cloud.
That is a really good point, and the 3090s have a surprising amount of VRAM on them. For many smaller models this is sufficient. However, where I work without going into a lot of specifics, because of the size of the models, the amount of VRAM is crucial, as well as the infrastructure of the PCI lanes connected to it, the speed of the local storage, and the networking between both cards on the same node as well as between nodes.
The moment the model gets to be bigger than the size of any one GPU's VRAM, the higher by orders of magnitude of difficulty in the process of training that model.
Re data, I think egress rates are going to start disappearing over the next few years.
The part that’s always missing with these rent vs buy analyses on HN for some reason is that it’s totally ignoring the opex cost of operating your own hardware which is going to be non 0. Sure, it won’t be quite as expensive (no profit margin) but it’s not an order of magnitude. Additionally, most companies don’t run the HW 24/7 and, if they do, it’s not a level of people they want to hire to support said operations. Not just running it, but you have to invest and grow something that’s not a core competency to get economies of multiple teams loading up the HW.
If the next revolution in cloud comes in to cause companies to onsite the HW again, it’ll look like making it super easy to take spare compute and spare storage from existing companies and resell it on an open market in an easy way. Even still, I think the operational challenges of keeping all that up and running and being utilized at as close to 100% as possible and not focusing on your core business problem will be difficult because you won’t be able to compete with engineering companies that have a core competency in that space.
> The part that’s always missing with these rent vs buy analyses on HN for some reason is that it’s totally ignoring the opex cost of operating your own hardware which is going to be non 0.
Effectively hiring, retaining, evaluating and rewarding competent staff is hard. Even at a big company the datacenter can be a really small world, which makes it hard for your best employees to grow. Things are especially hard when you don't have a tech brand to rely on for your recruiting, and the staff's expertise is far outside the company's core business, making it harder to evaluate who's good at anything.
> Re data, I think egress rates are going to start disappearing over the next few years.
I'm not sure why you think that. AWS hasn't budged on their egress pricing for a decade (except the recent free tier expansion), despite the underlying costs dropping dramatically. GCP and Azure have similar prices.
Fact is, egress pricing is a moat. Cloud providers want to incentivize bringing data in (ingress is always free) and incentivize using it (intra-DC networking is free), but disincentivize bringing it out. If your data is stuck in AWS, that means your computation is stuck in AWS too.
Disclosure: I work on Cloudflare on R2 so I’m a bit biased on this.
I think we’re going to put real pressure for traditional object storage rates to come down. Since Cloudflare‘a entire MO is running the network with zero egress. As we expand our cloud platform it seems inevitable that you will at least have a strong zero egress choice and if we do a good job Amazon et all will inevitably be forced to get rid of egress. Matthew Prince laid out a strong case for why either scenario is good for us in a recent investor day presentation (either we cannibalize S3’s business and R2 becomes a massive profit machine for us because they refuse to budge on egress or Amazon drops egress which is an even larger opportunity for us).
Products like Cache Reserve help you migrate your data out of AWS transparently from any service (not just S3) - you just pay the egress penalty once per file.
Anyway. I’m not saying it’s going to disappear tomorrow but I find it hard to believe it’ll last another ten years.
> totally ignoring the opex cost of operating your own hardware which is going to be non 0
Early in my career I worked at a company and we had a DC onsite. I remember the months long project to spec, purchase and migrate to a new, more powerful DB server. How much that costed in people-hours, I have no idea. I upgraded to a better DB a couple months ago by clicking a button...
Don't even get me started with ordering more SANs when we ran out of storage or the time a hurricane was coming and we had to prepare to fail over to another DC.
Cloudflare Bandwidth Alliance and R2. S3 felt some pressure just because of our pre launch announcement. It’ll be interesting to see how they adjust over the next couple of years.
They actually probably be paying more that the average listed prices, as Mainframe (basically a on-premise PaaS, where IBM rents you high performance, distributed and redundant hardware cluster on a pay-as-you-use manner) users are often dependent on very high reliability, high uptime and low latency.
The ability to scale up experiments is really nice in cloud. In my experience you need to be quite large before you’re using your own GPUs at a utilization percentage that saves money while still having capacity for large one off experiments.
There are a few different ways to run a data center, a subset of which are much less expensive than the cloud but require a level of competency that some organizations will never have. It can also be relatively pain-free when done well. Some workloads are inherently inefficient in the cloud because of the architecture.
Data center ops is ultimately a supply chain management problem, but most people don't treat it as such. That was my primary learning from doing data center ops at a few different companies. If you get the supply chain management right, and are technically competent, there can be a lot to recommend running your own data centers.
To a company you need to pay those costs no matter what.
If the AC breaks at 3am it neds to be fixed. It doesn't matter if you have your own HVAC people on sight 24x7, your own people on call to service it, a local HVAC service to come in, or you outsource the entire operations and so you have no idea how that is handled. In the end the important part of this story is that whatever you are doing with the AC continues to work. Different operations demand different levels of service (I doubt that anyone keeps HVAC techs on staff 24x7, but if the AC is that critical it is mandatory). The only case where the CEO is up at 3am is if the CEO is the owner of the local HVAC service company, not the CEO of the building with the problem.
Once you realize that to management the cost is outsourced no matter what the only question is do it with your own people and HR, or outsource it. There are pros and cons to both approaches, but for most companies it isn't their business and so the only reason to do it in house is they can't trust any company they hire.
The thing is, the cost for the HVAC 24x7x365 support for a datacenter will be roughly the same for a given location... but it makes a difference if it is you paying the whole bill (=you're self-hosting in your own datacenter), you are splitting the bill with a bunch of other customers indirectly (=you're self-hosting in a colo DC), or if you're splitting the bill with a shitload of other customers (=you're using some service on one of the big public cloud providers).
The downside for saving the costs is that you're losing control with every step taken away: as soon as you go into a datacenter of any kind, you simply cannot call up a HVAC company and offer them 100k in hard cash if they're showing up in the next 60 minutes and fix the issue. With a colo DC you can usually go and show up there to see if the HVAC, UPS and other systems are appropriate to your needs, but with one of the big cloud providers you have to trust their word that they are doing stuff correctly.
> so the only reason to do it in house is they can't trust any company they hire
Now I'm deeply confused. Any company hired either has a profit margin (plus enough to fund an "Oh shit" fund in case times turn bad) or will not stick around longer than a few years. At which case why not just hire people directly and cut out the other company's profit margin? Assuming you hire similar people at the same rate, using your own existing and already-paid-for HR, how is that not cheaper?
You need to deal with overhead. Nobody does their own HVAC in house because you rarely need them, and would have to pay to train people on that despite them not using it.
In some cases you can even get a discount. Utilities are. Big customer of tree trimming, the companies doing that work can give a great deal because the utility doesn't care that they take a week off after a storm for high profit margin consumer trimming.
Lots of places have their own HVAC techs in house, if they have enough HVAC work to justify it. Even if it's not their core line of business. They will do whatever costs less, +/- some amount of subjective "hassle factor."
Especially when it's "line critical" to their business, or if the person can do other things as well.
Larger hotels often have dedicated staff for things like HVAC, etc, because the importance of getting things fixed quick if possible is worth the cost of having someone onsite/available.
And you see similar things with colleges, etc; they often have a maintenance deportment that can be pretty large (though no doubt they've spun it off and brought it back in-house for the same "change is progress" reasons).
I have dealt with a large number of retail colo providers, wholesale data center providers and corporate owned data centers across the US over the last 20 years and all of them used contractors for HVAC and electrical. I'm not saying dedicated staff never happens but it is definitely not the norm.
Doesn't this logic apply to pretty much everything? Why hire external anything then? Why not do your own deliveries, hire your own trucks to transport goods etc?
There is a cost to taking on things that aren't part of your core business too.
All I can come up with is "Because economies of scale". I work for a transportation company, but we employ plumbers, carpenters, electricians, elevator repairmen, and many more that I'm not aware of, because we have enough locations / work to justify them. The pizza place has enough work to justify hiring a fleet of drivers, Amazon ships enough crap to justify having their own trucks (when they can't sucker another company into taking the unprofitable routes).
Similarly, Google doesn't ship enough stuff worldwide to justify drivers, insurance, trucks, jets, etc. - Fedex has the size and scale to make every package a couple cents cheaper, so it's just not worth it for Google.
The only other argument I can think of is the challenge of keeping every plate spinning, in good times and in bad. This is where your point of having a cost to take on something outside your core business comes in, but we seem to be in an era of mega-corporations - I'd expect lots of companies to snake tendrils into whatever will save them a fraction of a cent every time they have to do something.
Not really. Time to service depends on SLA and redundancy. If you have no redundancy your time to service must be less or equal than your SLA. If you have redundancy it can be longer.
I've got some experience with a big academic data center - >1 acre floor space, >10MW, ~$100M construction cost. I've also worked for commercial companies of various sizes.
If your compute installation is big enough that payroll is a small fraction of the operating cost, then it's way cheaper than cloud. (that payroll has to include people who actually know how to build and run a huge compute installation)
The problem is that people come in integer units, you need a bunch of them to cover a bunch of different areas of expertise, and the particular ones you need are expensive. If you've got $1M worth of computers, you're almost certainly better off scrapping them and going to cloud, although the folks you're currently paying to run them might disagree. If you have $100M+ worth of machines it's a whole different ballgame; I'm not sure where the exact crossover is.
Note - that's assuming a single data center, and that you're big enough to build your own data center instead of renting colo space. If you need your machines to be geographically dispersed, you'll need to be even bigger before it's cheaper than cloud, and I'm not sure whether you'll ever hit crossover if you're renting colo space.
1000% this. HN loves to talk about Dropbox. I spent most of my (short, praise God) career at Dropbox diagnosing a fleet of dodgy database servers we bought from HPE. Turns out they were full of little flecks of metal inside. Thousands of em, full of iron filings. You think that kind of thing happens when you are an AWS customer?
If you are sophisticated enough to engage an ODM, build your own facilities, and put hvac and electricians on 24-hour payroll, go on-prem. Otherwise, cloud all the way.
That's not quite where I would draw the line, I don't think. I used to work for an ISP and we were kind of split between AWS and on-prem. Obviously, things like terminating our customers' fiber feeds had to be on-prem, so there was no way to not have a data center (fortunately in the same building as our office). Moving our website to some server in there wouldn't have been much of a stretch to me, at the end of the day, it's just a backend for cloudflare anyway.
Like most startups, our management of the data center was pretty scrappy. Our CEO liked that kind of stuff, and we had a couple of network engineers that could be on call to fix overnight issues. It definitely wasn't a burden at the 50 employees size of company (and that includes field techs that actually installed fiber, dragged cable under the street, etc.)
We actually had some Linux servers in the datacenter. I don't know why, to be completely honest.
So overall my thought is that maybe use the cloud for your 1 person startup, but sometimes you need a datacenter only and it's not really rocket science. You're going to have downtime while someone drives to the datacenter. You're going to have downtime when us-east-1 explodes, too. To me, it's a wash.
I mean, you did want to manage bare metal servers, right?
AWS almost certainly gets batches of bad hardware too. And if your services are running on the bad hardware, you can't have a peek inside and find the iron filings. For servers, this is probably not too bad, there used to be articles about dealing with less enthusiastic ec2 vms since a long time, and if you experience that, you'd find a way. AWS has enough capacity that you can probably get vms running on a different batch of hardware somehow. With owned hardaware, if it was your first order of important database servers and they're all dodgy, that's a pickle; HPE probably has quick support? once you realize it's their hardware.
If your cloud provider's network is dodgy though, you get to diagnose that as a blackbox which is lots of fun. Would have loved to have access to router statistics.
There's a lot of stuff in betwren AWS and on-prem/owned datacenter, too.
> If you are sophisticated enough to engage an ODM, build your own facilities, and put hvac and electricians on 24-hour payroll, go on-prem. Otherwise, cloud all the way.
I imagine the entire sentiment of the comments is because FedEx is one that really should be sophisticated enough.
There is a smooth curve between cloud and dedicated DCs, which has various levels of managed servers, co-location, and managed DCs. (A managed DC can be a secure room in a DC "complex" that shares all the heavy infrastructure of DCs.)
Primarily, the FedEx managers are committing the company long-term to Oracle/Microsoft platforms. Probably mostly to benefit their own careers.
Outsourcing hosting and management of DCs would have been something different, and probably healthier for FedEx and the industry.
> You think that kind of thing happens when you are an AWS customer?
You bet it does! But as the AWS customer you'd never notice because some poor ops dude in AWS-land gets to call up the vendor and bitch at them instead of you. It ain't your problem!
Are you saying that part of the expected savings from going on-prem is that you will have to disassemble equipment bought from major OEMs and examine it for microscopic metal dust?
That doesn't sound like it will save much money, honestly.
They’re saying it’s a surprise to hear that Dropbox doesn’t know what QC and order acceptance means. And it is, I agree. That you spent the time investigating it, implying those servers were in production, is a shibboleth to those of us that know what we’re doing when designing hardware usage that Dropbox doesn’t. It is, however, your self sourced report and we don’t have an idea of scale, so maybe they do and you’re just unlucky.
And no, operators don’t disassemble to perform QC. And no, I could hire an entire division of people buying servers at Best Buy, and disassembling them, and stress testing them, and all of that overhead including the fuel to drive to the store would still clock in under cloud’s profit margin depending on what you’re doing.
You’re of course entitled to develop your cloud opinion from that experience. That’s like finding a stain in a new car and swearing off internal combustion as a useful technology, though, without any awareness of how often new cars are defective.
Many hardware problems do not surface at burn-in. Even at Google, the infamous "Platform A" from the paper "DRAM Errors in the Wild" was in large-scale production before they realized it was garbage.
Filings from the chassis stamper, which yours certainly were given the combination of circumstances and vendor, are present when the machine is installed. If you’re buying racks, your integrator inspects them. If you’re buying U, you do. It’s a five minute job to catch your thank-God-my-career-was-short story before the machine is even energized, which I know because I’ve caught the same thing from the same vendor twice. (It’s common; notice several comments point to it.) Why do you think QC benches have magnifiers and loupes? It’s a capital expenditure and an asset, so of course it’s rigorously inspected before the company accepts it, right? That’s not strange, is it?
You can point at Google and speak in abstracts but it doesn’t address the point being made, nor that your rationale for your extreme position on cloud isn’t as firm as you thought it was. Is Dropbox the only time you’ve worked with hardware? I’m genuinely asking because manufacturing defects can top 5% of incoming kit depending on who you’re dealing with. Google knew that when they built Platform A. The lie of cloud is that dismissing those problems is worth the margin (it ain’t; you send it back, make them refire the omelette, and eat the toast you baked into your capacity plan while you wait).
Are you saing you just buy some server unpack them and throw them into production.....oh man...the lost art of systemadmin, if your system is not stable (in testing) you for sure disassemble it, or send it back. How much money have you lost playing around with your unstable database? Was it more then test your servers for some weeks? Do you buy/build software and throw it into production without testing?
You can test your stuff and be still profitable henzer aws etc would make no money otherwise....you know they test their server much more (sometimes weeks/month)
Maybe in the first day's they survive it, but the flakes are 99% from the fans/bearings, that's why you test servers at max load for at least 1 week and HD's for 2-4 weeks.
But i don't think they made even a initial load-/stresstest.
Unpack it, trow it into the rack, no checking of internal plug's just nothing...pretty sure about that.
Metal chips is squarely in the long tail of failure modes that you can't really anticipate (but of course really easy to be smug about in hindsight). It is also extremely unlikely the bearings, most likely these are from chassis frames assy not cleaned up properly.
I had some metaldust and it was from bearings, but op said something flakes and then microscopic particles. Particles = bearings, flakes = chassis or even stickers, but anyway just because of transport you dont trow a server into production without testing and inspection.
I am beeing smug about not testing your hardware as you do it with software....shitty testing is shitty testing, counts for software hardware firmware and everything between. Even for your diesel generator ;-)
Wait, are you saying that an org needs expertise to QC all of the the hardware they procure? How expensive is that? How easy it is to hire that type of QC?
Well, are you saying that an org needs expertise to inspect faulty cars, like, by calling a mechanic?
Is that like too much these days for companies that owb fleets of cars? is opening a server harder than checking whars wrong with a car? like a cable comes loose and that's gane over?
The point, which you seem so dedicated to avoiding, is that "in the cloud" these steps are not my problem. Inspecting a literal shipload of computers for subtle defects is a pain in the ass. Amazon does it for me. When I get on an airplane I do not personally have to run the checklists. The airline does it for me.
> (but then do it right, and not like a amateur who build's his first "gaming-pc").
Again, still avoiding the point, but oddly enough proving the point. You assume everyone isn't an amateur and knows how to build and maintain server hardware. Furthermore, because the market doesn't have enough talent to support all of the companies that exist, consolidating this to a few vendors who do have the expertise is what makes sense (economies of scale) and is what the market already decided.
>Again, still avoiding the point, but oddly enough proving the point.
Please read, that was my comment:
>>Not true the point was you pay for it (cloud), or you do it yourself
>You assume everyone isn't an amateur and knows how to build and maintain server hardware.
Yes that i assume, correct. Otherwise i would not call it "maintaining", is a amateur maintaining your car? Your software? If you have just amateur's handling your hardware it's probably better to pay a cloud-provider or pay a integrator todo that.
>"I believe companies selling bare metal as a service are a happy compromise of cost and convenience, though."
This is what I do. I rent bare metal from Hetzner and OVH. I also have some hosting hardware right at my place. It saves me a ton of money and no I do not spend any meaningful time to administer. All done by a couple of shell scripts. I can re-create fresh service from the backup on a clean rented machine in no time.
As for cloud - if I need to run some simulation once a month on some bazillion core computer then sure. Cloud makes much sense in this particular case. I am sure there are other cases that can be cost effective. Bot for the average business I believe cloud is a waste of resources and money.
If you don't enjoy it then what were you doing working at a datacenter?
I enjoyed server admin, "back in the day", when your servers were pets and not cattle. But of course we have to make tech just as expendable as our workers, business school demands it! What if your pet server gets hit by a digital bus?!
Pet server classes is a much nicer concept anyway. I never liked the instance based personalization. Creating a machine, defining its class, and seeing it become a machine of its class is magical.
Of course, the newest idea is for creating and destroying machines automatically... that outside of the could is quite pointless but people want it anyway. I imagine seeing all that orchestration working must be even nicer than a machine autoconfiguring, but I am yet to see a place where it just works.
One argument for "cattle" servers on bare-metal is security. Being able to reset the machine to a clean, known-good state would clear any leftovers including potential malware. Having machines provisioned from images that include everything they need to run also means you don't even need to grant anyone root access (which you'd otherwise need to be able to audit so they don't leave anything malicious in there).
> Because, personally, I'd rather retire than deal with Dell, HP, Cisco, fibers, cooling issues, physical security, hardware failling...
This isn’t really a meaningful analysis though. It’s just “when you do things in house there are things you have to do”.
It’s like saying, “I would rather retire than clean the toilets, restock the toilet paper, etc” in a discussion about whether to outsource your bathroom maintenance. Doesn’t tell you what’s cost effective.
I'll be really curious how much change Oxide will bring to the status quo.
The promise is to be able to pay up front for a rack that will function as a highly capable VM, storage, and/or compute host, without any of the overhead that Dell, HP, and IBM bring. Just plug it in and start giving it workloads to do. All config can be done through the web-based management console or via the API, just like AWS.
All of that can be handled by your colocation facility. In most cases you won't ever reach the scale where building your own DC makes sense.
> Then you still need to pay VMWare for a decent virtualization platform
Should still be cheaper than paying the AWS premium including for bandwidth, not to mention that you don't always need virtualization. If all you need the bare-metal for is a handful of machines to do a very specific task that's too expensive on AWS then running directly on the metal is an option (and leave on AWS the stuff that does require the convenience of virtualization).
> I believe companies selling bare metal as a service are a happy compromise of cost and convenience, though.
Agreed. Most companies shouldn't ever deal with hardware directly - just rent it from a provider and let them do the maintenance.
I'm more or the less the sole decider for all tech decisions in my org (I don't have full budget authority, but I tell the budget holders what things cost). I'm 100% on board with cloud and even going further up the value chain to PaaS and SaaS. Cloud is expensive, but predictable. DevOps is very expensive and unpredictable. I can't even keep staff retained these days. Having a fixed dollar cost, even if it's high, saves not only the operations cost, but also the accounting cost and recruiting cost. And not just cost, but risk! Managed services are generally lower risk, and even if they aren't you can buy some indemnity that they'll cover some of the cost of failures.
I'm curious whether the folks claiming that have any data center ops experience.
Because, personally, I'd rather retire than deal with Dell, HP, Cisco, fibers, cooling issues, physical security, hardware failling... And that's just the hardware. Then you still need to pay VMWare for a decent virtualization platform, monitoring tools, etc.... Seriously, no amount of money would make me work in a DC again.
I believe companies selling bare metal as a service are a happy compromise of cost and convenience, though.