Hacker News new | past | comments | ask | show | jobs | submit login

It's worth pointing out that every cloud is the same when it comes to capacity / capacity risk. They all apply a lot of time and effort to figuring out the optimal amount of capacity to order based on track record of both customer demand and supply chain satisfaction.

Too much capacity is money spent getting no return, up front capex, ongoing opex, physical space in facilities etc.

On cloud scales (averaged out over all the customers) the demand tends to follow pretty stable and predictable patterns, and the ones that actually tend to put capacity at risk (large customers) have contracts where they'll give plenty of heads-up to the providers.

What has been very problematical over the past few years has been the supply chains. Intel's issues for a few years in getting CPUs out really hurt the supply chains. All of the major providers struggled through it, and the market is still somewhat unpredictable. The supply chain woes that have been wrecking chaos with everything from the car industry to the domestic white goods industry are having similar impacts on the server industry.

The level of unreliability in the supply chain is making it very difficult for the capacity management folks to do their job. It's not even that predictable which supply chain is going to be affected. Some of them are running far smoother and faster and capacity lands far faster than you'd expect, while others are completely messed up, then next month it's all flipped around. They're being paranoid, assuming the worst and still not getting it right.

This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

The best thing to try to do is do your best to be as hardware agnostic as is technically possible, so you can use whatever is available... which sucks.




In my experience there are differences between clouds so while all have the same basic problem in practice some may be better than others. I've never had issues getting GPUs on AWS but GCP constantly has issues with GPU/TPU capacity.


Is this region dependent? In us-east I can’t get them to approve a quota for GPU instance families (G,P) for anything more than 4 CPUs. At one point they rejected my request citing “unprecedented demand”. Of course this is small time, just my personal account.

It is true I can get an instance most of the time, but not if I need >16GiB GPU memory.


We've been having the same problem getting GPU instances on us-east. Multiple week-long delays to escalate and talk to yet the next person up who can make a decision. It's a mess.


There probably are difference occurrence rates. We had to modify how our test suite provisions instances, since we used to regularly run into instance availability constraints on EC2 during the holidays.


I’ve occasionally seen some of the internal AWS capacity management dashboards, and they can frequently be operating very close to 100% on some resource types.


I worked on a project about a year ago where we would have a colleague in a different time zone start instances with 4 gpus because it would almost always be unavailable during regular work hours for us-east


It may be a risk borne by every cloud provider, but why does this only really happen to Microsoft among large providers?

As far as chip shortages, it probably helps that Amazon makes its own chips. Microsoft could do the same rather than running out of capacity and blaming chip shortages.

Microsoft had to know that at some point they were going to run out of capacity. They should've either did something about it or let customers know.


There's all sorts of examples of AWS failing to be able to provide capacity too. Just do a search for "aws InsufficientInstanceCapacity" or similar. I remember Fortnite talking about capacity limits in relation to an incident, but I'm struggling to find the post-mortem I saw it in.

Even when Microsoft was being open about Azure having difficulty getting Intel chips, AWS, GCP etc. were in the same position and just not really talking about it. From my time in AWS there were some other times when some services with specialised hardware came really, really close to running out of capacity and had to scramble around with major internal "fire drills" against services to recoup capacity.

Most people won't run in to these issues, the clouds all tend to be good at it, but they still happen.

There are also advantages of the economy of scale and brand recognition. The more customers you have the more the capacity trends smooth out, the easier it is to predict need, even if you're still stuck with uncertainty on the ordering side.


It’s certainly true I run into these things with AWS as well, but it’s generally limited to a specific instance type/availability zone combination. I’ve never had all instance types unavailable.

If anything, I’m surprised we can just spin up a few hundred instances out of nowhere and not run into capacity issues.


AWS has capacity issues you can generally mitigate. Azure however will just lock you out of a solution completely and tell you to switch regions as if that was some trivial thing.


They have a lot of technical debt. They have like 6 different clouds (at least 4 gov clouds alone) and SLA commitments to things like O365 that silo their infrastructure.

MS also makes all sorts of crazy deals and commitments, and I wouldn’t be surprised if being collocated with a strategic customer may lead to local shortages of resources.


AWS has at least 3 publicly-discussed 'clouds' (or partitions, as they're called at AWS). There may or may not be other partitions that cannot be discussed publicly.


There’s a pretty clean demarc between the AWS clouds. With Microsoft because they have O365 and Azure AD dependencies sprinkled everywhere with varying features it’s a real mess. So you can do government contract with with device managed by Windows Autopilot & Intune in a commercial cloud, have email in a Gov Community Cloud, and deliver apps in a US Gov cloud, all with different identities etc.


> As far as chip shortages, it probably helps that Amazon makes its own chips.

IDK what chips you are talking about, all x86 (which I assume is most of their compute) is Intel or AMD. If they make their own they are only making the ARM instances.


AWS has three processors: Graviton, Inferentia, and Trainium. They're made in-house.

https://aws.amazon.com/silicon-innovation/


And none of the above are x86. Even if they're making their own silicon, it is for specialized use (ML) and not general server provisioning.


Amazon's own chips are ARM. ARM requires somewhat specialized builds of software that are likely different than development instances, CI/CD, and/or local dev machines. It's not insurmountable but does certainly complicate usage.


Your local dev machines might be Macs though, in which case it might be easier for you to go with ARM servers than x86.


They might be. My local dev machine is a Mac. I've found Intel or Intel+ARM container images; never an ARM only. Again, not insurmountable but certainly more resistance than the straight intel route.


> This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

Yup. And a few of the OEMs have stopped talking about supply chain integrity. Many folks have observed more memory and power supply problems since the pandemic.


All cloud providers are NOT equal here. Amazon over-provisions and sells the excess capacity as spot instances.


So does google, so does azure etc. etc. https://cloud.google.com/spot-vms, https://azure.microsoft.com/en-us/products/virtual-machines/...

Spot instances exist just to try to turn over-provisions in to not a complete loss. You're at least making some money from your mistake.

edit: You should consider "spot instances" in general to be a failure as far as a cloud provider is concerned. It means you've got your guesses wrong. You always want a buffer zone, but not that much of a buffer zone. The biggest single cost for cloud providers is the per-rack OpEx, the cost of powering, cooling etc.


Cloud providers aren't guessing at demand to plan capacity, they're literally building new data centers and then wheeling new racks into them as fast as they physically can (short-term decisions are more likely made at the other end, e.g. when to retire old systems, not add new ones). AWS was born out of the fact that Amazon's own compute needs are inherently variable so to meet peak demand they had to "over-provision" compared to average demand--this in turn meant they had a lot of excess compute power most of the time. At the point when Amazon still was a dominant consumer of AWS, spot instances were actually a deliberate convenience to Amazon, since it meant AWS could monetize resources while still ensuring Amazon could claim them instantly when needed (later they added a two minute warning, but early on they could literally disappear at any moment, and regularly did).


You're talking to someone who has spent the last decade working for major cloud providers, including AWS, on infrastructure and services sides of things, including work around data feeds for the capacity management teams. I have more than a passing familiarity with the way things actually work at a cloud.

They are constantly guessing at cloud capacity. Short, medium, and long term models with forecasting galore, all under constant recalculation based on customer actions (they literally take live feeds of creation/termination actions), and yes they also take in to account hardware failure and repair rates. Consolidating racks of equipment is a pain in the neck and tends to be avoided, unless you can safely live migrate away all instances.

They all build up various models, using all sorts of forecasting techniques. The longer range forecasts are involved in data center provisioning, along with other business analysis, market research, legal analysis etc. that helps define where future regions should be.

It's still a guess. They can't tell what the actual demand will be, and they can't tell what is going to happen with the supply chain (supply chain issues are the biggest nightmare for capacity planning teams). Sometimes they get it wrong.

The capacity management teams spend a lot of time and expertise to keep the company just sufficiently ahead of demand. It's a crucial part of keeping costs under control.


It's logistics no more and no less. Logistics has been a thing for ever (satisfy a resource requirement). My old man (is not a dustman) but he was Commander Supply for quite a lot of people. At one job, he and his staff would worry about things like Austrian plain chocolate covered mint centred frogs (I'm not joking) to Gurkha rice and not much else (some very concentrated protein etc) water-proofed combat rations. This was in Cyprus in the '80s. Logistics on the green line in Cyprus is probably still as mad now due to the number of countries in the UN.

Anyway, capacity planning is very well understood in general but of course the devil is in the details.

At the moment the IT supply chain is pretty spotty and that affects my little IT firm up to the big boys.

When you buy Cisco + HPE + Dell or whatevs, you go to your reseller (me). I go to my distributor and they suck hardware out of Dell etc and take their cut and I install the gear and take my cut. Sometimes a disty thinks they can do reseller too. The thinking is that they can roll up two lots of margin and shave a bit. That's fine if you can actually do logistics and the "teeth arm" job too.

Clouds think they can go even further and sometimes they can and sometimes not. Now we have a sodding complicated resource on offer with a supply chain that is a bit random.

The whole hyperscale cloud premise is based on infinite availability of raw resources and that is complete bollocks. You can't hyperscale if you can't source stuff indefinitely.

Those Austrian mint filled choccy frogs became a thing for a while. I gave no idea of the exact numbers but presumably Austria supplied quite a lot of them for the UN forces and families in Cyprus in the '80s - they became a bargaining chip for a while. They came in a cardboard package with a lid coloured light blue with outlines of frogs and I think the main box was dark brown or black.


So does Azure.


Never happened to me in AWS.

Wasn't the whole point of "the cloud" that these things shouldn't happen?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: