Kobe Bryant basically commuted by helicopter, when it was convenient. It may have even taken off and landed at his house, but probably not exactly at all of his destinations. Is a “flying car” fundamentally that much different?
I think the difference is that a helicopter is extremely technical to fly requiring complex and expensive training, and the eVTOL is supposed to be extremely simple to fly. Also the eVTOL in principle is really cheap to make if you just consider the materials and construction costs- probably eventually much cheaper than a car.
I was curious so I looked up how much you can buy the cheapest new helicopters for, and they are cheaper than an eVTOL right now- the XE composite is $68k new, and things like that can be ~25k used. I'm shocked one can in principle own a working helicopter for less than the price of a 7 year old Toyota Camry.
Nothing that flies in the air is that safe for its passengers or its surroundings - not without restrictions placed on it and having a maintenance schedule that most people would not be comfortable following.
Most components are safety critical in ways that their failure can lead to an outright crash or feeding the pilot false information leading him to make a fatal mistake. Most cars can be run relatively safely even with major mechanical issues, but something as 'simple' as a broken heater on a pitot tube (or any other component) can lead to a crash.
Then there's an issue of weather - altitude, temperature, humidity, wind speed can create an environment that makes it either impossible, unsafe, or extremely unpleasant - imagine flying into an eddy current that stalls out the aircraft, making your ass drop a few feet.
Flying's a nice hobby, and I have great respect to people who can make a career out of it, but I'd definitely not get into these auto-piloted eVTOLs, nor should people who don't know what they are doing.
Edit: Also unlike helicopters, which can autorotate, and fixed wing aircraft, that can glide, eVTOLs just drop out of the sky.
I would expect eVTOLs to be capable of greater redundancy than a helicopter or fixed wing aircraft - with no single point of failure that could make them drop from the sky. It would add little weight to have two or more independent electrical and motor systems each capable of making a semi controlled landing on its own, but must coordinate to provide the full rated lift. Marketing materials claim the Blackfly has triple redundancy. I suppose one could have software logic glitches that cause all modular systems to respond inappropriately to conditions in unison.
eVTOLs are going to be much more expensive to build than helicopters because they have far more stringent weight/strength requirements due to low battery energy density (relative to aviation fuel).
The idea is to have far cheaper operating costs. Electric motors are far more efficient than ICE, so you should have much cheaper energy costs. Electric motors are also simpler than ICE so you should have cheaper maintenance with less required downtime compared to helicopters.
Of course, most of this is still being tested and worked on. But we are getting closer to having these get certified (FAA just released the SFAR for eVTOL, the first one since the 1940s).
But I'm sure running costs (aviation fuel), hanger costs, maintenance costs, cost to maintain pilot license are far more expensive, compared to driving a car.
I'm talking about buying the absolute cheapest possible used experimental helicopter- homemade by a stranger from a cheap kit. I would posit that if I were willing to take that risk- probably buying a model with know design and reliability issues to save money- I'd also just park it in the backyard, skip the maintenance and run it on the cheapest pump gas I can find!
The ones I'm seeing in the 20k range are mostly the "Mini 500." Wikipedia suggests that maybe as few as 100 were built, with 16 fatalities thus far (or is it 9- which it says in a different part of the article?). But some people argue all of those involved "pilot error."
I suppose choosing to fly the absolute cheapest homemade experimental aircraft kit notorious for a high fatality rate is technically a type of pilot error?
Can you imagine thousands of flying cars flying low over urban areas?
Skill level needed for "driving" would increase by a lot, noise levels would be abysmal, security implications would be severe (be they intentional or mechanical in nature), privacy implications would result in nobody wanting to have windows.
This is all more-or-less true for drones as well, but their weight is comparable to a todler, not to a polar bear. I firmly believe they'll never reach mass usage, but not because they're impossible to make.
I had a friend who used to (still does) fly RC helicopters; that requires quite a bit of skill. Meanwhile, I think anybody can fly a DJI drone. I think that's what will transform "flying" when anybody, not just a highly skilled pilot, can "drive" a flying car (assuming it can be as safe as a normal car... which somehow I doubt)
Waymo is the best driver I’ve ridden with. Yes it has limited coverage. Maybe humans are intervening, but unless someone can prove that humans are intervening multiple times per ride, “self driving” is here, IMO, as of 2024.
In what sense is self-driving “here” if the economics alone prove that it can’t get “here”? It’s not just limited coverage, it’s practically non-existent coverage, both nationally and globally, with no evidence that the system can generalize, profitably, outside the limited areas it’s currently in.
It's covering significant areas of 3 major metros, and the core of one minor, with testing deployments in several other major metros. Considering the top 10 metros are >70% of the US ridehail market, that seems like a long way beyond "non-existent" coverage nationally.
You’re narrowing the market for self-driving to the ridehail market in the top 10 US metros. That’s kinda moving the goal posts, my friend, and completely ignoring the promises made by self-driving companies.
The promise has been that self-driving would replace driving in general because it’d be safer, more economical, etc. The promise has been that you’d be able to send your autonomous car from city to city without a driver present, possibly to pick up your child from school, and bring them back home.
In that sense, yes, Waymo is nonexistent. As the article author points out, lifetime miles for “self-driving” vehicles (70M) accounts for less than 1% of daily driving miles in the US (9B).
Even if we suspend that perspective, and look at the ride-hailing market, in 2018 Uber/Lyft accounted for ~1-2% of miles driven in the top 10 US metros. [1] So, Waymo is a tiny part of a tiny market in a single nation in the world.
Self-driving isn’t “here” in any meaningful sense and it won’t be in the near-term. If it were, we’d see Alphabet pouring much more of its war chest into Waymo to capture what stands to be a multi-trillion dollar market. But they’re not, so clearly they see the same risks that Brooks is highlighting.
There are, optimistically, significantly less than 10k Waymos operating today. There are a bit less than 300M registered vehicles in the US.
If the entire US automotive production were devoted solely to Waymos, it'd still take years to produce enough vehicles to drive any meaningful percentage of the daily road miles in the US.
I think that's a bit of a silly standard to set for hopefully obvious reasons.
> ..is a tiny part of a tiny market in a single nation in the world.
Calculator was a small device that was made in one tiny market in one nation in the world. Now we all got a couple of hardware ones in our desk drawers, and a couple software ones on each smartphone.
If a driving car can perform 'well' (Your Definition May Vary - YDMV) in NY/Chicago/etc. then it can perform equally 'well' in London, Paris, Berlin, Brussels, etc. It's just that EU has stricter rules/regulations while US is more relaxed (thus innovation happens 'there' and not 'here' in the EU).
When 'you guys' (US) nail self-driving, it will only be a matter of time til we (EU) allow it to cross the pond. I see this as a hockey-stick graph. We are still on the eraser/blade phase.
if you had read the F-ing article, which you clearly did not, you would see that you are committing the sin of exponentiation: assuming that all tech advances exponentially because microprocessor development did (for awhile).
Development of this technology appears to be logarithmic, not exponential.
He's committing the "sin" of monotonicity, not exponentiation. You could quibble about whether progress is currently exponential, but Waymo has started limited deployments in 2-3 cities in 2024 and wide deployments in at least SF (its second city after Phoenix). I don't think you can reasonably say its progress is logarithmic at this point - maybe linear or quadratic.
Speaking for one of those metro areas I'm familiar with: maybe in SF city limits specifically (where they still are half the Uber's share), but that's 10% of the population of the Bay Area metro. I'm very much looking forward to the day when I can take a robo cab from where I live near Google to the airport - preferably, much cheaper than today's absurd Uber rates - but today it's just not present in the lives of about 95+% of Bay Area residents.
> preferably, much cheaper than today's absurd Uber rates
I just want to highlight that the only mechanism by which this eventually produces cheaper rates is by removing having to pay a human driver.
I’m not one to forestall technological progress, but there are a huge number of people already living on the margins who will lose one of their few remaining options for income as this expands. AI will inevitably create jobs, but it’s hard to see how it will—in the short term at least—do anything to help the enormous numbers of people who are going to be put out of work.
I’m not saying we should stop the inevitable forward march of technology. But at the same time it’s hard for me to “very much look forward to” the flip side of being able to take robocabs everywhere.
People living on the margins is fundamentally a social problem, and we all know how amenable those are to technical solutions.
Let's say AV development stops tomorrow though. Is continuing to grind workers down under the boot of the gig economy really a preferred solution here or just a way to avoid the difficult political discussion we need to have either way?
I'm not sure how I could have been more clear that I'm not suggesting we stop development on robotaxis or anything related to AI.
All I'm asking is that we take a moment to reflect on the people who won't be winners. Which is going to be a hell of a lot of people. And right now there is absolutely zero plan for what to do when these folks have one of the few remaining opportunities taken away from them.
As awful as the gig economy has been it's better than the "no economy" we're about to drive them to.
This is orthogonal. You're living in a society with no social safety net, one which leaves people with minimal options, and you're arguing for keeping at least those minimal options. Yes, that's better than nothing, but there are much better solutions.
The US is one of the richest countries in the world, with all that wealth going to a few people. "Give everyone else a few scraps too!" is better than having nothing, but redistributing the wealth is better.
But this is the society we live in now. We don’t live in one where we take care of those whose jobs have been displaced.
I wish we did. But we don’t. So it’s hard for me to feel quite as excited these days for the next thing that will make the world worse for so many people, even if it is a technological marvel.
Just between trucking and rideshare drivers we’re talking over 10 million people. Maybe this will be the straw that breaks the camel’s back and finally gets us to take better care of our neighbors.
Yeah but it doesn't work to on the one hand campaign for not taking rideshare jobs away from people on an online forum, and on the other say "that's the society we live in now". If you're going to be defeatist, just accept those jobs might go away. If not, campaign for wealth redistribution and social safety nets.
Public transit has a fundamentally local impact. It takes away some jobs but also provides a lot of jobs for a wide variety of skills and skill levels. It simultaneously provides an enormous number of benefits to nearby populations, including increased safety and reduced traffic.
Self-driving cars will be disruptive globally. So far they primarily drive employment in a small set of the technology industry. Yes, there are manufacturing jobs involved but those are overwhelmingly going to be jobs that were already building human-operated vehicles. Self-driving cars will save many lives. But not as many as public transit does (proportionally per user) And it is blindingly obvious they will make traffic worse.
Waymo's current operational area in the bay runs from Sunnyvale to fisherman's wharf. I don't know how many people that is, but I'm pretty comfortable calling it a big chunk of the bay.
They don't run to SFO because SF hasn't approved them for airport service.
I just opened the Waymo app and its service certainly doesn't extend to Sunnyvale. I just recently had an experience where I got a Waymo to drive me to a Caltrain station so I can actually get to Sunnyvale.
The public area is SF to Daly City. The employee-only area runs down the rest of the peninsula. Both of them together are the operational area.
Waymo's app only shows the areas accessible to you. Different users can have different accessible areas, though in the Bay area it's currently just the two divisions I'm aware of.
Why would you consider the employee-only area? For that categorization to exist it must mean it's either unreliable for customers or too expensive cause there's too much human drivers on the loop. Either way it would not be considered as an area served by self driving, imo.
There are alternative possibilities, like "we don't have enough vehicles to serve this area appropriately" or "we don't have statistical power to ensure this area meets safety standards even though it looks fine", and "there are missing features (like freeways) that would make public service uncompetitive in this area" to simply "the CPUC hasn't approved a fare area expansion".
It's an area they're operating legally, so it's part of their operational area. It's not part of their public service area, which I'd call that instead.
I wish! In Palo Alto the cars have been driving around for more than a decade and you still can't hail one. Lately I see them much less often than I used to, actually. I don't think occasional internal-only testing qualifies as "operational".
Where's the economic proof of impossibility? As far as I know Waymo has not published any official numbers, and any third party unit profitability analysis is going to be so sensitive to assumptions about e.g. exact depreciation schedules and utilization percentages that the error bars would inevitably be straddling both sides of the break-even line.
> with no evidence that the system can generalize, profitably, outside the limited areas it’s currently in
That argument doesn't seem horribly compelling given the regular expansions to new areas.
Analyzing Alphabet’s capital allocation decisions gives you all the evidence necessary.
It’s safe to assume that a company’s ownership takes the decisions that they believe will maximize the value of their company. Therefore, we can look at Alphabet’s capital allocation decisions, with respect to Waymo, to see what they think about Waymo’s opportunity.
In the past five years, Alphabet has spent >$100B to buyback their stock; retained ~100B in cash. In 2024, they issued their first dividend to investors and authorized up to $70B more in stock buybacks.
Over that same time period they’ve invested <$5B in Waymo, and committed to investing $5B more over the next few years (no timeline was given).
This tells us that Alphabet believes their money is better spent buying back their stock, paying back their investors, or sitting in the bank, when compared to investing more in Waymo.
Either they believe Waymo’s opportunity is too small (unlikely) to warrant further investment, or when adjusted for the remaining risk/uncertainty (research, technology, product, market, execution, etc) they feel the venture needs to be de-risked further before investing more.
Isn’t there a point of diminishing returns? Let’s assume they hand over $70B to Waymo today. Can Waymo even allocate that?
I view the bottlenecks as two things. Producing the vehicles and establishing new markets.
My understanding of the process with the vehicles is they acquire them then begin a lengthy process of retrofitting them. It seems the only way to improve (read: speed up) this process is to have a tightly integrated manufacturing partner. Does $70B buy that? I’m not sure.
Next, to establish new markets… you need to secure people and real estate. Money is essential but this isn’t a problem you can simply wave money at. You need to get boots on the ground, scout out locations meeting requirements, and begin the fuzzy process of hiring.
I think Alphabet will allocate money as the operation scales. If they can prove viability in a few more markets the levers to open faster production of vehicles will be pulled.
This is just a quirk of the modern stock market capitalist system. Yes, stock buybacks are more lucrative than almost anything other than a blitz-scaling B2B SAAS. But for good of society, I would prefer if Alphabet spent their money developing new technologies and not on stock buybacks / dividends. If they think every tech is a waste of money, then give it to charity, not stock buybacks. That said, Alohabet does develop new technologies regularly. Their track record before 2012 is stellar, their track record now is good (Alphafold, Waymo, Tensorflow, TPU etc), and it is nowhere close to being the worst offender of stock buybacks (I’m looking at you Apple), but we should move away from stock price over everything as a mentality and force companies to use their profits for the common good.
> Alphabet has to buy back their stock because of the massive amount of stock comp they award.
Wait, really? They're a publically traded company; don't they just need to issue new stock (the opposite of buying it back) to employees, who can then choose to sell it in the public market?
That's a very hand wavy argument. How about starting here:
> Mario Herger: Waymo is using around four NVIDIA H100 GPUSs at a unit price of $10,000 per vehicle to cover the necessary computing requirements. The five lidars, 29 cameras, 4 radars – adds another $40,000 - $50,000. This would put the cost of a current Waymo robotaxi at around $150,000
There are definitely some numbers out there that allow us to estimate within some standard deviations how unprofitable Waymo is
(That quote doesn't seem credible. It seems quite unlikely that Waymo would use H100s -- for one, they operate cars that predate the H100 release. And H100s sure as hell don't cost just $10k either.)
You're not even making a handwavy argument. Sure, it might sound like a lot of money, but in terms of unit profitability it could mean anything at all depending on the other parameters. What really matters is a) how long a period that investment is depreciated over; b) what utilization the car gets (ot alternatively, how much revenue it generates); c) how much lower the operating costs are due to not needing to pay a driver.
Like, if the car is depreciated over 5 years, it's basically guaranteed to be unit profitable. While if it has to be depreciated over just a year, it probably isn't.
Do you know what those numbers actually are? I don't.
Here in the product/research sense, which is the hardest bar to cross. Making it cheaper takes time but generally we have reduced cost of everything by orders of magnitude when manufacturing ramps up, and I don't think self driving hardware(sensors etc) would be any different.
It’s not even here in the product/research sense. First, as the author points out, it’s better characterized as operator-assisted semi-autonomous driving in limited locations. That’s great but far from autonomous driving.
Secondly, if we throw a dart on a map: 1) what are the chances Waymo can deploy there, 2) how much money would they have to invest to deploy, and 3) how long would it take?
Waymo is nowhere near a turn-key system where they can setup in any city without investing in the infrastructure underlying Waymo’s system. See [1] which details the amount of manual work and coordination with local officials that Waymo has to do per city.
And that’s just to deploy an operator-assisted semi-autonomous vehicle in the US. EU, China, and India aren’t even on the roadmap yet. These locations will take many more billions worth of investment.
Not to mention Waymo hasn’t even addressed long-haul trucking, an industry ripe for automation that makes cold, calculated, rational business decisions based on economics. Waymo had a brief foray in the industry and then gave up. Because they haven’t solved autonomous driving yet and it’s not even on the horizon.
Whereas we can drop most humans in any of these locations and they’ll mostly figure it out within the week.
Far more than lowering the cost, there are fundamental technological problems that remain unsolved.
No, there is outright porn. They do this "trick" where they flash a nude photo for like 100ms on a 10s video and caption it with something like "pause it for the good part".
The sad thing is that I've reported these posts and they always say it doesn't violate their terms.
>The sad thing is that I've reported these posts and they always say it doesn't violate their terms.
I'll just quote an experiment conducted by my colleague, where they tracked the outright malicious (porn, malware or fraud) ads they reported from a regular user account: "from January to November 2024, we tested (...) 122 such malicious ads (...) in 106 cases (86.8%), the reports were closed with the status “We did not remove the ad”, in 10 cases the ad was removed, and in 6 cases, we did not receive any response". That's not very encouraging.
> The sad thing is that I've reported these posts and they always say it doesn't violate their terms.
Nothing I reported on FB was ever removed, even obvious spam (e.g. comments in different language than the rest of the thread, posted as a reply to every top-level comment in the thread). I think this message is most likely just generated automatically. And maybe, if hundred people report the same thing, someone will review it. Or it will be automatically deleted.
I started thinking today, when Nvidia seemingly keeps just magically increasing performance every two years, that they eventually have to "intel" themselves, where they haven't made any real architectural improvements in ~10 years and just suddenly power and thermals don't scale anymore and you have six generations of turds that all perform essentially the same, right?
it's possible, but idk why you would expect that. just to pick an arbitrary example since steve ran some recent tests, a 1080 ti is more or less equal to a 4060 in raster performance, but needs more than double the power and a much more die area to do it.
we do see power requirements on the high end parts every generation, but that may be to maintain the desired SKU price points. there's clearly some major perf/watt improvements if you zoom out. idk how much is arch vs node, but they have plenty of room to dissipate more power over bigger dies if needed for the high end.
I can’t exactly compare ray tracing performance when it didn’t exist at that time. or is this a joke about rendering games no longer being the primary use case for an nvidia gpu?
Nvidia is a very innovative company. They reinvent solutions to problems while others are trying to match their old solutions. As long as they can keep doing that, they will keep improving performance. They are not solely reliant on process node shrinks for performance uplifts like Intel was.
>They are not solely reliant on process node shrinks for performance uplifts like Intel was.
People who keep giving intel endless shit are probably very young and don't remember how innovative Intel was in the 90s and 00s. USB, PCI-Express, Thunderbolt, etc., all Intel inventions, plus involvement in Wifi and wireless telecom standards. They are guilty of anti competitive practices and complacency in the last years but their innovations weren't just node shrinks.
Those standards are plumbing to connect things to the CPU. The last major innovations that Intel had in the CPU itself were implementing CISC in RISC with programmable microcode in the Pentium and SMT in the Pentium 4. Everything else has been fairly incremental and they were reliant on their process node advantage to stay on top. There was Itanium too, but that effort was a disaster. It likely caused Intel to stop innovating and just rely on its now defunct process node advantage.
Intel’s strategy after it adopted EM64T (Intel’s NIH syndrome name for amd64) from AMD could be summarized as “increase realizable parallelism through more transistors and add more CISC instructions to do key work loads faster”. AVX512 was that strategy’s zenith and it was a disaster for them since they had to cut clock speeds when AVX-512 operations ran while AMD was able to implement them without any apparent loss in clock speed.
You might consider the more recent introduction of E cores to be an innovation, but that was a copy of ARM’s big.little concept. The motivation was not so much to save power as it was for ARM but to try to get more parallelism out of fewer transistors since their process advantage was gone and the AVX-512 fiasco had showed that they needed a new strategy to stay competitive. Unfortunately for Intel, it was not enough to keep them competitive.
Interestingly, leaks from Intel indicate that Intel had a new innovation in development called Royal Core, but Pat Gelsinger cancelled it last year before he “resigned”. The cancellation reportedly lead to Intel’s Oregon design team resigning.
> AVX512 was that strategy’s zenith and it was a disaster for them since they had to cut clock speeds when AVX-512 operations ran while AMD was able to implement them without any apparent loss in clock speed.
AMD up until zen 5 didn't have a full AVX-512 support so not exactly a fair comparison. Intel designs don't suffer from that issue AFAIU for couple of iterations already.
But I agree with you, I always thought and I still do that Intel has a very strong CPU core design but where AMD changed the name of the game IMHO is the LLC cache design. Hitting as much as ~twice lower LLC latency is insane. To hide this big of a difference in latency, Intel has to pack larger L2+LLC cache sizes.
Since LLC+CCX design scales so well AMD is also able to pack ~50% more cores per die, something Intel can't achieve even with the latest Granite Rapids design.
These two reasons let alone are big things for data center workloads so I really wonder how Intel is going to battle that.
AVX-512 is around a dozen different ISA extensions. AMD implemented the base AVX-512 and more with Zen 4. This was far more than Intel had implemented in skylake-X where their problems started. AMD added even more extensions with Zen 5, but they still do not have the full AVX-512 set of extensions implemented in a single CPU and neither does Intel. Intel never implemented every single AVX-512 extension in a single CPU:
It also took either 4 or 6 years for Intel to fix its downclocking issues, depending on whether you count Rocket Lake as fixing a problem that started in enterprise CPUs, or require Sapphire Rapids to have been released to consider the problem fixed:
The latter makes a big impact wrt available memory BW per core, at least when it comes to the workloads whose data is readily available in L0 cache. Intel in these experiments is crushing AMD by a large factor simply because their memory controller design is able to sustain 2x64B loads + 1x64B stores in the same clock. E.g. 642 GB/s (Golden Cove) vs 334 GB/s (zen4) - this is a big difference and this is something that Intel had for ~10 years whereas AMD was able to solve this with zen5, basically only with the end of 2024.
Former one limits the theoretical FLOPS/core capabilities since single AVX-512 FMA operation in zen4 is implemented as two AVX2 uops occupying both FMA slots per clock. This is also big and, again, this is something where Intel had a lead up until zen5.
Wrt downclocking issues, they had a substantial impact with Skylake implementation but with Ice Lake this was a solved issue and this was in 2019. I'm cool with having ~97% of max freq budget available with heavy AVX-512 workloads.
OTOH AMD is also very thin with this sort of information and some experiments show that turbo boost clock frequency on zen4 lowers from one CCD to another CCD [1]. It seems like zen5 exhibits similar behavior [2].
So, although AMD is displaying continuous innovation for the past several years this is only because they had a lot to improve. Their pre-zen (2017) designs were basically crap and could not compete with Intel who OTOH had a very strong CPU design for decades.
I think that the biggest difference in CPU core design really is in the memory controller - this is something Intel will need to find an answer to since AMD matched all the Intel strengths that it was lacking with zen5.
System memory is not able to sustain such memory bandwidth so it seems like a moot point to me. Intel’s CPUs reportedly cannot sustain such memory bandwidth even when it is available:
Not sure I understood you. You think that AVX-512 workload and store-load BW are irrelevant because main system memory (RAM) cannot keep up with the speed of CPU caches?
I think the benefits of more AVX-512 stores and loads per cycle is limited because the CPU is bottlenecked internally as shown in the slides from TACC I linked:
Your 642 GB/s figure should be for a single Golden Cove core, and it should only take 3 Golden Cove cores to saturate the 1.6 TB/sec HBM2e in Xeon Max, yet internal bottlenecks prevented 56 Golden Cove cores from reaching the 642 GB/s read bandwidth you predicted a single core could reach when measured. Peak read bandwidth was 590 GB/sec when all 56 cores were reading.
According to the slides, peak read bandwidth for a single Golden Cove core in the sapphire rapids CPU that they tested is theoretically 23.6GB/sec and was measured at 22GB/sec.
Chips and Cheese did read bandwidth measurements on a non-HBM2e version of sapphire rapids:
They do not give an exact figure for multithreaded L3 cache bandwidth, but looking at their chart, it is around what TACC measured for HBM2e. For single threaded reads, it is about 32 GB/sec from L3 cache, which is not much better than it was for reads from HBM2e and is presumably the effect of lower latencies for L3 cache. The Chips and Cheese chart also shows that Sapphire Rapids reaches around 450 GB/sec single threaded read bandwidth for L1 cache. That is also significantly below your 642 GB/sec prediction.
The 450 GB/sec bandwidth out of L1 cache is likely a side effect of the low latency L1 accesses, which is the real purpose of L1 cache. Reaching that level of bandwidth out of L1 cache is not likely to be very useful, since bandwidth limited operations will operate on far bigger amounts of memory than fit in cache, especially L1 cache. When L1 cache bandwidth does count, the speed boost will last a maximum of about 180ns, which is negligible.
What bandwidth CPU cores should be able to get based on loads/stores per clock and what bandwidth they actually get are rarely ever in agreement. The difference is often called the Von Neumann bottleneck.
> Your 642 GB/s figure should be for a single Golden Cove core
Correct.
> That is also significantly below your 642 GB/sec prediction.
Not exactly the prediction. It's an extract from one of the Chips and Cheese articles. In particular, the one that covers the architectural details of Golden Cove core and not Sapphire Rapids core. See https://chipsandcheese.com/p/popping-the-hood-on-golden-cove
From that article, their experiment shows that Golden Cove core was able to sustain 642 GB/s in L1 cache with AVX-512.
> They do not give an exact figure for multithreaded L3 cache bandwidth,
They quite literally do - it's in the graph in "Multi-threaded Bandwidth" section. 32-core Xeon Platinum 8480 instance was able to sustain 534 GB/s from L3 cache.
> The Chips and Cheese chart also shows that Sapphire Rapids reaches around 450 GB/sec single threaded read bandwidth for L1 cache.
If you look closely into my comment you're referring to you will see that I explicitly referred to Golden Cove core and not to the Sapphire Rapids core. I am not being pedantic here but they're actually different things.
And yes, Sapphire Rapids reach 450 GB/s in L1 for AVX-512 workloads. But SPR core is also clocked @3.8Ghz which is much much less than what the Golden Cove core is clocked at - @5.2GHz. And this is where the difference of ~200 GB/s comes from.
> Reaching that level of bandwidth out of L1 cache is not likely to be very useful, since bandwidth limited operations will operate on far bigger amounts of memory than fit in cache, especially L1 cache
With that said, both Intel and AMD are limited by the system memory bandwidth and both are somewhere in the range of ~100ns per memory access. The actual BW value will depend on the number of cores per chip but the BW is roughly the same since it heavily depends on the DDR interface and speed.
Does that mean that both Intel and AMD are basically of the same compute capabilities for workloads that do not fit into CPU cache?
And AMD just spent 7 years of their engineering efforts to implement what now looks like a superior CPU cache design and vectorized (SIMD) execution capabilities only to be applicable very few (mostly unimportant in grand scheme of things) workloads that actually fit into the CPU cache?
I'm not sure I follow this reasoning but if true then AMD and Intel have nothing to compete against each other since by the logic of CPU caches being limited in applicability, their designs are equally good for the most $$$ workloads.
It is not that the entire working set has to fit within SRAM. Kernels that reuse portions of their inputs several times, such as matmul, can be compute bound and there AMD's AVX-512 shines.
Parent comment I am responding to is arguing that CPU caches are not that relevant because the CPU for bigger workloads is anyways bottlenecked by the system memory BW. And thus, AVX-512 is irrelevant because it can only provide compute boost for a very small fraction of time (reciprocal to the size of the L1 cache).
Your description of what I told you is nothing like what I wrote at all. Also, the guy here is telling you that AVX-512 shines on compute bound workloads, which is effectively what I have been saying. Try going back and rereading everything.
Sorry, that's exactly what you said and the reason why we are having this discussion in the first place. I am guilty of being too patient with trolls such as yourself. If you're not a troll, then you're clueless or detached from reality. You're just spitting a bunch of incoherent nonsense and moving goalposts when lacking an argument.
I am a well known OSS developer with hundreds of commits in OpenZFS and many commits in other projects like Gentoo and the Linux kernel. You keep misreading what I wrote and insist that I said something I did not. The issue is your lack of understanding, not mine.
I said that supporting 2 AVX-512 reads per cycle instead of 1 AVX-512 read per cycle does not actually matter very much for performance. You decided that means I said that AVX-512 does not matter. These are very different things.
If you try to use 2 AVX-512 reads per cycle for some workload (e.g. checksumming, GEMV, memcpy, etcetera), then you are going to be memory bandwidth bound such that the code will run no faster than if it did 1 AVX-512 read per cycle. I have written SIMD accelerated code for CPUs and the CPU being able to issue 2 SIMD reads per cycle would make zero difference for performance in all cases where I would want to use it. The only way 2 AVX-512 reads per cycle would be useful would be if system memory could keep up, but it cannot.
I agree server CPUs are underprovisioned for memBW. Each core's share is 2-4 GB/s, whereas each could easily drive 10 GB/s (Intel) or 20+ (AMD).
I also agree "some" (for example low-arithmetic-intensity) workloads will not benefit from a second L1 read port.
But surely there are other workloads, right? If I want to issue one FMA per cycle, streaming from two arrays, doesn't that require maintaining two loads per cycle?
In an ideal situation where your arrays both fit in L1 cache and are in L1 cache, yes. However, in typical real world situations, you will not have them fit in L1 cache and then what will happen after the reads are issued will look like this:
* Some time passes
* Load 1 finishes
* Some time passes
* Load 2 finishes
* FMA executes
As we are doing FMA on arrays, this is presumably part of a tight loop. During the first few loop iterations, the CPU core’s memory prefetcher will figure out that you have two linear access patterns and that your code is likely to request the next parts of both arrays. The memory prefetcher will then begin issuing loads before your code does and when the CPU issues a load that has already been issued by the prefetcher, it will begin waiting on the result as if it had issued the load. Internally, the CPU is pipelined, so if it can only issue 1 load per cycle, and there are two loads to be issued, it does not wait for the first load to finish and instead issues the second load on the next cycle. The second load will also begin waiting on a load that was done early by the prefetcher. It does not really matter whether you are issuing the AVX-512 loads in 1 cycle or 2 cycles, because the issue of the loads will occur in the time while we are already waiting for the loads to finish thanks to the prefetcher beginning the loads early.
There is an inherent assumption in this that the loads will finish serially rather than in parallel, and it would seem reasonable to think that the loads will finish in parallel. However, in reality, the loads will finish serially. This is because the hardware is serial. On the 9800X3D, the physical lines connecting the memory to the CPU can only send 128-bits at a time (well, 128-bits that matter for this reasoning; we are ignoring things like transparent ECC that are not relevant for our reasoning). An AVX-512 load needs to wait for 4x 128-bits to be sent over those lines. The result is that even if you issue two AVX-512 reads in a single cycle, one will always finish first and you will still need to wait for the second one.
I realize I did not address L2 cache and L3 cache, but much like system RAM, neither of those will keep up with 2 AVX-512 loads per cycle (or 1 for that matter), so what will happen when things are in L2 or L3 cache will be similar to what happens when loads come from system memory although with less time spent waiting.
It could be that you will end up with the loop finishing a few cycles faster with the 2 AVX-512 read per cycle version (because it could make the memory prefetcher realize the linear access pattern a few cycles faster), but if your loop takes 1 billion cycles to execute, you are not going to notice a savings of a few cycles, which is why I think being able to issue 2 AVX-512 loads instead of 1 in a single cycle does not matter very much.
OK, we agree that L1-resident workloads see a benefit.
I also agree with your analysis if the loads actually come from memory.
Let's look at a more interesting case. We have a dataset bigger than L3. We touch a small part of it with one kernel. That is now in L1. Next we do a second kernel where each of the loads of this part are L1 hits. With two L1 ports, the latter is now twice as fast.
Even better, we can work on larger parts of the data such that it still fits in L2. Now, we're going to do the above for each L1-sized piece of the L2. Sure, the initial load from L2 isn't happening as fast as 2x64 bytes per cycle. But still, there are many L1 hits and I'm measuring effective FMA throughput that is _50 times_ as high as the memory bandwidth would allow when only streaming from memory. It's simply a matter of arranging for reuse to be possible, which admittedly does not work with single-pass algorithms like a checksum.
> They quite literally do - it's in the graph in "Multi-threaded Bandwidth" section. 32-core Xeon Platinum 8480 instance was able to sustain 534 GB/s from L3 cache.
They do not. The chip has 105MB L3 cache and they tested on 128MB of memory. This exceeds the size of L3 cache and thus, it is not a proper test of L3 cache.
> If you look closely into my comment you're referring to you will see that I explicitly referred to Golden Cove core and not to the Sapphire Rapids core. I am not being pedantic here but they're actually different things.
Sapphire Rapids uses Golden Cove cores.
> And yes, Sapphire Rapids reach 450 GB/s in L1 for AVX-512 workloads. But SPR core is also clocked @3.8Ghz which is much much less than what the Golden Cove core is clocked at - @5.2GHz. And this is where the difference of ~200 GB/s comes from.
This would explain the discrepancy between your calculation and the L1 cache performance, although being able to get that level of bandwidth only out of L1 cache is not very useful for the reasons I stated.
> I'm not sure I follow this reasoning but if true then AMD and Intel have nothing to compete against each other since by the logic of CPU caches being limited in applicability, their designs are equally good for the most $$$ workloads.
You seem to view CPU performance as being determined by memory bandwidth rather than computational ability. Upon being correctly told L1 cache memory bandwidth does not matter since the bottleneck is system memory, you assume that only system memory performance matters. That would be true if the primary workload of CPUs were memory bandwidth bound workloads, but it is not since the primary workloads of CPUs is compute bound workloads. Thus, how fast CPUs read from memory does not really matter for CPU workloads.
The purpose of a CPU’s cache is to reduce the von Neumann bottleneck by cutting memory access latency. That way the CPU core spends less time waiting before it can use the data and it can move on to a subsequent calculation. How much memory throughput CPUs get from L1 cache is irrelevant to CPU performance outside of exceptional circumstances. There are exceptional circumstances where cache memory bandwidth matters, but they are truly exceptional since any importan workload where memory bandwidth matters is offloaded to a GPU because a GPU often has 1 to 2 orders of magnitude more memory bandwidth than a CPU.
That said, it would be awesome if the performance of a part could be determined by a simple synthetic benchmark such as memory bandwidth, but that is almost never the case in practice.
> They do not. The chip has 105MB L3 cache and they tested on 128MB of memory. This exceeds the size of L3 cache and thus, it is not a proper test of L3 cache.
First, you claimed that there was no L3 BW test. Now, I am not even sure if you're trolling me or lacking knowledge or what at this point?
Please do tell what you consider a "proper test of L3 cache"? And why do you consider
their test invalid?
I am curious because triggering 32 physical core threads to run over 32 independent chunks of data (totaling 3G and not 128M) seems like a pretty valid read BW experiment to me.
> Sapphire Rapids uses Golden Cove cores.
Right, but you missed the part that former is configured for the server market and the latter for the client market. Two different things, two different chips, different memory controllers if you wish. That's why you cannot compare one to each other directly without caveats.
Chips and Cheese are actually guilty of doing that but it's because they're lacking more HW to compare against. So some figures that you find in their articles can be misleading if you are not aware of it.
> You seem to view CPU performance as being determined by memory bandwidth rather than computational ability.
But that's what you said trying to refute the fact why Intel was in a lead over AMD up until zen5? You're claiming that AVX-512 workloads and load-store BW are largely irrelevant because CPUs are anyway bottlenecked by the system memory bandwidth.
> That would be true if the primary workload of CPUs were memory bandwidth bound workloads, but it is not since the primary workloads of CPUs is compute bound workloads. Thus, how fast CPUs read from memory does not really matter for CPU workloads.
I am all ears to hear what datacenter workloads you have in mind that are CPU-bound?
Any workload besides the most simplest one is at some point bound by the memory BW.
> The purpose of a CPU’s cache is to reduce the von Neumann bottleneck by cutting memory access latency.
> That way the CPU core spends less time waiting before it can use the data and it can move on to a subsequent calculation.
> How much memory throughput CPUs get from L1 cache is irrelevant to CPU performance outside of exceptional circumstances.
You're contradicting your own claims by saying that cache is there to hide (cut) the latency but then you continue to say that this is irrelevant. Not sure what else to say here.
> but they are truly exceptional since any importan workload where memory bandwidth matters is offloaded to a GPU because a GPU often has 1 to 2 orders of magnitude more memory bandwidth than a CPU.
99% of the datacenter machines are not attached to the GPU. Does that mean that 99% of datacenter workloads are not "truly exceptional" for whatever the definition of that formulation and they are therefore mostly CPU bound?
Or do you think they might be memory-bound but are missing out for not being offloaded to the GPU?
> First, you claimed that there was no L3 BW test.
I claimed that they did not provide figures for L3 cache bandwidth. They did not.
> Now, I am not even sure if you're trolling me or lacking knowledge or what at this point?
You should be grateful that a professional is taking time out of his day to explain things that you do not understand.
> Please do tell what you consider a "proper test of L3 cache"? And why do you consider their test invalid?
You cannot measure L3 cache performance by measuring the bandwidth on a region of memory larger than the L3 cache. What they did is a partially cached test and it does not necessarily reflect the true L3 cache performance.
> I am curious because triggering 32 physical core threads to run over 32 independent chunks of data (totaling 3G and not 128M) seems like a pretty valid read BW experiment to me.
You just described a generic memory bandwidth test that does not test L3 cache bandwidth at all. Chips and Cheese’s graphs show performance at different amounts of memory to show the performance of the memory hierarchy. When they exceed the amount of cache at a certain level, the performance transitions to different level. They did benchmarks on different amounts of memory to get the points in their graph and connected them to get a curve.
> Right, but you missed the part that former is configured for the server market and the latter for the client market. Two different things, two different chips, different memory controllers if you wish. That's why you cannot compare one to each other directly without caveats.
The Xeon Max chips with its HBM2e memory is the one place where 2 AVX-512 loads per cycle could be expected to be useful, but due to internal bottlenecks they are not.
Also, for what it is worth, Intel treats AVX-512 as a server only feature these days, so if you are talking about Intel CPUs and AVX-512, you are talking about servers.
> But that's what you said trying to refute the fact why Intel was in a lead over AMD up until zen5? You're claiming that AVX-512 workloads and load-store BW are largely irrelevant because CPUs are anyway bottlenecked by the system memory bandwidth.
I never claimed AVX-512 workloads were irrelevant. I claimed doing more than 1 load per cycle on AVX-512 was not very useful for performance.
Intel losing its lead in the desktop space to AMD is due to entirely different reasons than how many AVX-512 loads per cycle AMD hardware can do. This is obvious when you consider that most desktop workloads do not touch AVX-512. Certainly, no desktop workloads on Intel CPUs touch AVX-512 these days because Intel no longer ships AVX-512 support on desktop CPUs.
To be clear, when you can use AVX-512, it is useful, but the ability to do 2 loads per cycle does not add to the usefulness very much.
> I am all ears to hear what datacenter workloads you have in mind that are CPU-bound?
This is not a well formed question. See my remarks further down in this reply where I address your fabricated 99% figure for the reason why.
> Any workload besides the most simplest one is at some point bound by the memory BW.
Simple workloads are bottlenecked by memory bandwidth (e.g. BLAS levels 1 and 2). Complex workloads are bottlenecked by compute (e.g. BLAS level 3). A compiler for example is compute bound, not memory bound.
> You're contradicting your own claims by saying that cache is there to hide (cut) the latency but then you continue to say that this is irrelevant. Not sure what else to say here.
There is no contradiction. The cache is there to hide latency. The TACC explanation of how queuing theory applies to CPUs makes it very obvious that memory bandwidth is inversely proportional to memory access times, which is why the cache has more memory bandwidth than system RAM. It is a side effect of the actual purpose, which is to reduce memory latency. That is an attempt to reduce the von Neumann bottleneck.
To give a concrete example, consider linked lists. Traversing a linked list requires walking random memory locations. You have a pointer to the first item on the list. You cannot go to the second item without reading the first. This is really slow. If the list is frequently accessed to be in cache, then the cache will hide the access times and make this faster.
> 99% of the datacenter machines are not attached to the GPU. Does that mean that 99% of datacenter workloads are not "truly exceptional" for whatever the definition of that formulation and they are therefore mostly CPU bound?
99% is a number you fabricated. Asking if something is CPU bound only makes sense when you have a GPU or some other accelerator attached to the CPU that needs to wait on commands from the CPU. When there is no such thing, asking if it is CPU bound is nonsensical. People instead discuss being compute bound, memory bandwidth bound or IO bound. Technically, there are three ways to be IO bound, which are memory, storage and network. Since I was already discussing memory bandwidth bound work loads, my inclusion of IO bound as a category refers to the other two subcategories.
By the way, while memory bandwidth bound workloads are better run on GPUs than CPUs, that does not mean all workloads on GPUs are memory bandwidth bound. Compute bound workloads with minimal branching are better done on GPUs than CPUs too.
Intel's E cores are literally derived from the Atom product line. But the practice of including a heterogeneous mix of CPU core types was developed and proven and made mainstream within the ARM ecosystem before being hastily adopted by Intel as an act of desperation (dragging Microsoft along for the ride).
As long as TSMC keeps improving die size it will keep getting incremental improvements. These power/thermal improvements are not really that much up to nvidia.
The intel problem was that their foundries couldn't improve the die size while the other foundries kept improving theirs. But technically nvidia can switch foundry if another one proves better than TSMC even though that doesn't seem likely (at least without a major breakthrough not capitalized by ASML).
I mean it's like 1/6 of their revenue now and will probably keep sliding in importance over the datacenter. No real competition no matter how we would wish. AMD seems to have given up on the high end and Intel is focusing on the low end (for now, unless they cancel it in the next year or so).
From what I've seen they've targeted the low end in price, but solid mid-range in performance. It's hard to know if that's a strategy to get started (likely) with price increases down the road or they're really that competitive.
Intel's iGPUs were low end. Battlemage looks firmly mid-range at the moment with between 4060/4070 performance in a lot of cases.
There is one major 4 letter difference - TSMC. Nvidia will get tech process improvements until TSMC can't deliver, and if that happens we have way bigger problems... because Apple will get mad they can't reinvent iPhone again... and will have to make it fun and relatable instead by making it cheaper and plastic again.
Huh? Nvidia does three things well:
- They support the software ecosystem - Cuda isn't a moat, but it's certainly an attractive target.
- They closely follow fab leaders (and tend not to screw up much on logistics).
- They do introduce moderate improvements in hardware design/features, not a lot of silly ones, and tending to buttress their effort to make Cuda a moat.
None of this is magic. None of it is even particularly hard. There's no reason for any of it to get stuck. (Intel's problem was letting the beancounters delay EUV - no reason to expect there to be a similar mis-step from Nvidia.)
I can't remember where or how I pieced this together, but I'm guessing that each plane will have two dishes (which will be bonded together), and Starlink is expecting bandwidth to improve to Gbit speed, so it will probably be 2 Gb/s down for the whole plane. Still not great if every passenger is streaming HD video, but I imagine with some "traffic shaping" (aka throttling) it will be pretty snappy for web browsing and small file downloads.
I could also see this being some executive trying to justify the word "AI" in their title with an initiative that should "make number go up" wrt to engagement or something.
"When we have more users engagement goes up, let's just _make more users_".
How this isn’t a bigger deal? Especially the story of Svetlana Dali, who successfully flew to France unticketed, but was later caught and then somehow only placed under house arrest, and practically escaped again??
reply