Hacker News new | past | comments | ask | show | jobs | submit login
Apple's M1 Ultra comes with a 32MB TLB bottleneck (twitter.com/vadimyuryev)
287 points by retskrad on April 13, 2022 | hide | past | favorite | 165 comments



This sequence of posts is written very strangely and imprecisely?

You'd almost never say "32MB of TLB". You'd say maybe, 8192 entries of TLB, each with 4kB pages, for instance. (Others here in Hacker News suggest that the M1 uses a 16kB page, so maybe 2048 entries of 16kB each?)

The Twitter posts are talking about temperature at first, tile-memory. I admit that I'm ignorant to tile-memory but... The traditional answer to TLB bottlenecks is to enable large-pages or huge-page support. So I'm already feeling a bit weird that large pages / huge pages haven't been discussed in the twitter thread. (Maybe its not the answer on the GPU side of things? But at least acknowledging how we'd solve it in CPU-land would help show off how features like huge-pages could help)

Like, none of the logic here makes sense to me at all. Maybe something weird is going on in the Apple / M1 world, but I have a suspicious feeling that maybe the person posting this Twitter thread is at best... misstating things in a very confusing manner and using imprecise language.

At worst, they might be fundamentally incorrect on some of these discussion points?


I don’t code for GPUs but it seemed pretty straightforward to me:

1. General purpose GPU code assumes optimization characteristics where memory constraints are less restrictive for this use case

2. Without optimizing for this platform, typical code will not exercise the full potential of the chip

The thread is discussing it as a design flaw; I don’t know if it is. But if normal “recompile for ____ target” in XCode or whatever isn’t enough for common workloads to be relatively close to optimal, and you have to do and know to do special work for the memory architecture… yeah that’s probably not ideal. And the other takeaway from the thread, that they’re likely to throw more memory hardware at the problem in the future, sounds right to me.


> I don’t code for GPUs but it seemed pretty straightforward to me:

I don't know all the words that are being used in this tweet. But the words I _DO_ understand are being used incorrectly, which is a major red-flag.

1. The temperature issues are non-sequitur, but that's where the discussion starts. There are a whole slew of power-configuration items on modern chips, none of them discussed in this set of twitter posts.

2. TLB is a CPU-issue, traditionally. I understand that Apple has an iGPU that shares the memory controller, but GPU-workloads are quite often linear and regular. It is difficult for me to imagine any step in the shader-pipeline (vertex shader, geometry shader, pixel / fragment shader) that would go through memory so randomly as to mess up the TLB. If you're saying the TLB is the problem, you need to come forth with a plausible use-case why your code is hopping around RAM so much to cause page-walks.

3. In fact, everything they say about GPUs in the thread is blatantly wrong at its face, or possibly some kind of misstatement about the peculiarity of the M1 Apple iGPU.

4. I realize that mobile GPU programmers are always talking about this "tile-based rendering" business (popular on Apple / Android phones), and I've never bothered to learn the details of it. But I can almost assure you, even in my ignorance, that this has nothing to do with TLBs at all.

-----

The argument simply does not track with reality, or my understanding of GPUs in general.

The vague advice I've heard for GPUs on mobile systems is to learn about the peculiarities of tile-based rendering, and to optimize for it specifically, because that's the main difference in GPU-architecture between Desktop GPUs and Phone GPUs. But nothing discussed in this set of twitter posts seems to match what I've often heard about iPhone/Android GPU programming.

I've seen the experts advice on Android/Apple phone architecture / tile-based rendering. I don't know it, but... I think I can recognize it when I see them talking. This set of twitter posts is not an expert talking.


The only problem: Apple GPUs are general purpose GPUs and other GPUs have bigger memory constraints than Apple G13. It is kind of funny that they complain about the small cache on M1 Ultra when 3090 Ti has 10x less cache


Not just that, but the claim that GPUs don’t mask memory latency by switching to other work is patently false as well. That’s exactly how GPUs do it. Memory bandwidth is a huge issue on GPUs and to maximize throughput, they absolutely do context switches to other work.


The key is apple code especially os and their tools. That is what m1 and future apple cpu and gpu are for. Hence optimisation should not be the issue. The whole thing is all about optimisation.

The question is whether it affect other or 3rd party products (that cannot compete, or not shown to the public eg the private api ). And does apple care.

Unless it shows is that limitation important ?


The AGX GPU MMU (variously called "GART" and "UAT", no relation to AGP GART) uses the ARM64 page table format (the GPU coprocessor literally uses it as an ARM page table among other things), so I'd expect it to support huge pages just fine since ARM64 does too. I don't know if macOS supports them, though.

The page size is indeed 16K. I don't know if 4K is supported at all in the GPU.

I agree though, the article reads like a rambly piece that doesn't follow and really just boils down to "my code runs slow on this machine and I'm going to blame the architecture" without going into proper details.


> This sequence of posts is written very strangely and imprecisely?

Likely because the person doesn't have any background in understanding the subject.

Twitter profile:

> Jesus Christ is King! Co-host/writer for the Max Tech YouTube channel.

They're a youtuber, that's the limit of their qualifications. And it's not like they're technical either. They review consumer electronics. They probably don't even know anything about CPU architecture.


Putting "Christ is King!" in your Twitter bio has the same problem as putting a Christian bumper sticker on your car: the person most likely to read it is the person you just flipped off.


Tweet storm coincided with an uploaded Video to YT at the same time about the “issue”. This is promoting their content.


And

>transaction lookaside buffer

not

>translation lookaside buffer

I think hishnash made this discovery and this chain of tweets is echoing it.


Some of the main assertions about how common GPUs work are totally backwards:

"when a GPU is waiting for data, it can't just switch to work on something else" - GPU designs specifically use the idea of rotating between threads to not block on memory access.

"the concept that there is a local on-die memory pool you can read/write from with very very low perf impact is unthinkable in the current desktop GPU space" - Optimizing for the characteristics of the specific device's memory hierarchy including local/shared memory is a big part of GPU coding.

If Apple follows the pattern that companies like NVIDIA have used they will keep the overall shape of the GPU similar between generations (in terms of how many threads, memory hierarchy, etc). This will give app devs relatively similar + stable tuning targets, we'll see high dollar apps get aggressively optimized for specific chips, and as that happens power/heat will go up. For an extreme example of this maturation curve look at how console games get more and more out of the same hardware.

All that said the Metal tile shading API is a real thing Apple has been pushing for years on the mobile side, presumably because it fits the hardware and performs better, and now Macs use similar chips. No surprises there. And if an app is built around OpenGL and now needs to port to Metal to improve performance, sure, that could be a lot of work.


Quick comment on the tile stuff: Apple GPUs basically rasterize primitives into small tiles (16x16 or 32x32 pixels), tracking primitive coverage and barycentric coordinates per pixel. Once all primitives are rasterized, the entire tile is shaded at once. That’s where the “deferred” in tile-based deferred shading comes from. This helps to utilize both the compute hardware and memory subsystem more efficiently. Tile data is stored in the very fast on-chip shared memory (essentially large programmable register file). Apple also exposes this abstraction to the user, allowing them to freely manipulate the contents of this memory and persist it through multiple kernel invocations. Using tile memory explicitly can radically simplify some rendering algorithms, improve performance efficiency and reduce the RAM bandwidth needs. Since the concept tiles are so important to Apple GPUs, they also support bulk memory moves which allow one to push the contents of the entire tile (multiple KBs) in one operation, which is obviously more efficient than doing scattered writes.

So it’s exactly as you say: of course one can improve performance by using these features, but it’s not that one has to do it to get good performance in the first place. If there are performance issues with the Ultra, memory hierarchy is an unlikely culprit IMO.


Don't know anything about GPUs and am a CS undergrad. How does this apply to compute workloads? What's wrong with treating it like an IMR GPU when it comes to compute? is the tile buffer used for compute? ELI 5 please.


There is nothing wrong with treating G13 as an IMR GPU. But might be able better performance and efficiency when doing for example image processing (or a task that maps to it conceptually) via the rendering pipeline with tile shaders. This way you can process your data in small chunks, using very fast on-GPU memory file. In the end, tile memory is the same thing as shared memory in CUDA, but you can persist its contents between multiple different compute and fragment shaders. It's a very powerful programming model that makes local data relationships explicit.


I'm not familiar with the tile-specific stuff Apple is doing but I think the grain of truth in that thread is some graphics apps on Apple silicon will perform better if they use them. The rest of the thread is doesn't make sense to me.

If you want an idea of what GPU programming is like you could look at some CUDA intros:

- https://developer.nvidia.com/blog/cuda-refresher-cuda-progra...

- http://cuda.ce.rit.edu/cuda_overview/cuda_overview.htm

If you don't have CUDA hardware you can get various versions of OpenCL that will at least work on your CPU. Lots of interesting compiler stuff happening targeting GPUs.


> Don't know anything about GPUs and am a CS undergrad.

You are the computer in "Computer Science," and the science is really just a lot of math. Maybe you wanted to study CE instead? I made the same damn mistake.


It seems like it would be good if people who join to ask a question do not receive answers such as this.


I unironically enjoy CS. I like programming and I like being close to the metal. I have an interest in the engineering part but I don't know if I'd pick it over CS. Life is hard.

Also like compilers, and most compiler engineers are CS grads. That's my dream job!


If you like compilers, I suggest you check out functional programming. Especially the languages in the ML family were basically tailor made to make writing compilers fun.

ML even stands for meta language.

See eg http://dev.stephendiehl.com/fun/ or https://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_...


Nice! I cloned the repo so I can read through it. I already took some basic haskell so this looks like it's going to be fun to go through!


And yet, ironically, Computer Science has absolutely nothing to do with programming or computers. CS is a subset of Mathematics. That's all it is. I was being literal when I claimed you were the computer, as in the one who computes. This is confusing, and I am not the only one that would like to change its name to Reckoning Science, as no one would misunderstand what that means.

A computer scientist can solve very hard problems. But a computer engineer can also solve hard problems, yet they can also solve really easy ones, giving them the edge.


Your definition of Computer Science seems to differ from standard accepted definitions?

Not that it matters for SwordOfMyBone: what matters is what you learn when you enroll in a specific university's Computer Science course.

Btw, why couldn't programming or compilers be a subset of math? Look at eg 'type theory' https://en.wikipedia.org/wiki/Type_theory

Or optimization. It's a thing we do very much in math and in computing.

Applied mathematics is a wide field.


Some of the cliams the parent is making don't hold up to scrutiny, but the core of their argument does.

It's sort of like how geometry literally means 'the measurement of land' since it was originally an applied field developed for surveying farm plots for the bronze age super powers after each flood along the Nile and the levant wiped the property boundaries away. I think it's safe to say it has expanded beyond it's original intention, and CS has the same properties.


That is the way language works; enough people insist or use the word "cabbage" to mean computer science, eventually it does. But no matter how many people believe that computer science is programming, it will not make it so, because a discipline does not evolve like spoken language does. Those that believe falsehoods simply remain mistaken and do not affect what they are mistaken about to somehow become correct over time. Things are what they are and false expectations won't change that. Computers have evolved, certainly, the tools have gotten much better. But don't confuse the tool with the tool user or toolmaker. Computer Science is the same as it ever was.


No, Computer Science is a subset of Mathematics, and originally lived in the Math department for about 2000 years. Only recently, in the last 30 years or so, has computer science been mistakenly equated with programming or mistakenly been believed to have something to do with fixing desktop computers or building networks.

Programming is not mathematics, nor is it computer science. Programming is programming, and it is most like the recipes in cooking, which no one would ever confuse with math. If one wants to be a programmer, the thing to do is learn programming languages, no degree required. OTOH, if one wants to model the weather, or model traffic, or model society, or model the galaxy, or work in informatics, or hunt for weaknesses in the genetic code of some virus, or really, generally, solve complex problems of any ilk, then study computer science, because that is what computer scientists do.

Do not bother with CS if you want to work with computers, or be a computer tech, or systems administrator. Do not bother with CS if you just want to be a programmer, really, it'd be a waste of time, sort of like getting a MD because you want to be a RN, but this metaphor is not apt because there is no hierarchy like that in technology space. The confusion of programming with computer science is such a sensitive one, and a lot of schools now have Software Engineering departments. IMO, calling it engineering is being a little too kind to programming, even if we call a program an engine, it's really not. Also, there is no license available to engineer software, yet all other engineers need a license to work.

That being said, a Computer Science curriculum will entail learning a lot of programming, but that still does not mean that computer science is programming. But before learning program languages, in CS, one will study algorithms, and I believe those do fall under the scope of CS.

Here is another bad metaphor: You'll have to learn a good amount of Trigonometry in order to be able to do much Calculus, yet Calculus is not Trigonometry. Now, one can do good computer science without ever learning any programming language, and without ever touching a computer. But a computer is a tool, and the good analogy is that the computer is to the computer scientist what the telescope is to the astronomer. To be clear, though it was once so, astronomers do not build telescopes, nor are they necessarily expert in the construction of or even function of a telescope. The important part is they look through it, and that is how it is used.


I agree with the core of what you're saying, but CS has not existed for two millennia. It's fairly new math starting in the late 1800s at the earliest.


Ancient civilizations observed astronomical bodies, often the Sun and Moon, to determine time. In a very real sense, the Sun and Moon are together a computer that one uses to calculate, or compute, what time it is. Sundials are analog computers, no telling when they first appeared, but at least as early as 1500BC in Egypt. The abacus is a computation tool that was developed as early as 2700BC in Sumer. Its development required CS, and it is used to calculate, or to compute. The rules of grammar of Sanskrit were formulated ~500BC, highly systematized and technical, using metarules, transformations and recursions, iow, the work itself is computer science. The Antikythera mechanism is an analog computer designed to calculate astronomical positions and created from anywhere between 205BC to 87BC, and without computer science (and a number of other disciplines) it could not have been made. 1000 years later Medieval Muslim astronomers created analog computers, such as the torquetum, which converted measurements in three sets of coordinates: Horizon, equatorial, and ecliptic. Developing this would have required computer science. There is also the astrolab. When John Napier discovered logarithms for computational purposes in the early 17th century it opened a period of considerable progress by inventors and scientists in making calculating tools, and doing good computer science as they went.

No, I'm afraid most of the computer science that has ever been performed occurred prior to the 19th Century.


Just like computer science is greater than programming computers, programming computers is greater than computer science. Neither is a complete subset of the other.

It's sort of how like geometry (literally 'the measure of land', originally an applied mathematical field used for surveying) is newer than literally drawing property boundaries. It was the formalization and abstraction that led to computer science as a true mathematical field, which happened after actual programming occurred. We can reproject CS concepts on these earlier programs looking back, but the field of CS is newer than they are.


I have a lot of respect for coders. I also massively appreciate their humor. I don't think there is such a relation between a computer scientist and a programmer where one is greater or better than the other; they are merely different professions. I expect a programmer that has worked using Java, ObjC and C++ for, say, a decade, will have a better grasp of programming those languages than a CS grad 10 years after graduation, even if they learned those languages and used them, they probably weren't only programming, but also what appears to the untrained eye as sitting around like a lump. Often that's what contemplation looks like. A programmer codes a lot, so they'll be better at it.

It has been pointed out to me before that mostly what computer scientists do is reckon. This is why I am onboard with renaming Computer Science to Reckoning Science. It is a subtle thing, but I think it would prevent countless students from wasting the best years of their lives due to incorrect expectations initiated by the confusing name of the discipline. It really is important to understand that the "computer" in Computer Science is not a digital machine, a mainframe, server, a Dell, an Acer, or a Mac, instead it is a living breathing person. I have to envy the Math majors and Math grads. They knew exactly what they were getting into, not a one of them ever asked, "what do you mean, 'it's all math,' ?"


By greater, I didn't mean in softer, moral terms, but instead that the sets of problems and cognitive tools for approaching problems for both has some overlap, but one isn't a pure subset of the other. Trying (but failing apparently) to drive home my point that CS wasn't a prerequisite for programming for human civilization, therefore historical examples of programming are ultimately orthogonal to how old of a field CS itself is.


Well, ok, and I misread your post, thank you for clarification. But you've employed a straw man, because no one has claimed anything about historical examples of programming to make claims about how old computer science is. What I did instead (in the GGGGP comment?) was give ancient examples of computer science to show how old it must be, at a minimum. Now, of course, there was no CS department in a university 5000 years ago. Nevertheless, computer science was performed in ancient Sumer, and we know this because we found their abacuses.


This person's goals seem to be different from yours. Why not let him or her study what they're interested in without acting as if it's a mistake? What you see as a mistake in degree choice is based on your experience coming from your school living your life.


If the individual's goals were to be a computer scientist, then all is well. But if by their own statements of what they want out of computer science it is clear that they will not find there, or only find as a very small part of CS along the overhead of everything else, like the minor in mathematics that usually automatically comes with a CS degree, which they did not expect, then maybe it's ok that I offer a little insight into what CS really is. Because it's math, and anyone interested in CS must understand this, that CS is math, and programming is not computer science, and that studying computer science does not mean becoming intimately familiar with how a computer works. It's pretty important.


I actually graduated around 5 months ago. The only thing that makes me sad is that I don't have a job yet and am finding it difficult to find one because there are no opportunities where I live (like seriously not a single C++ job exists in my country only .net - HELP ME).

The math itself maps to a higher abstraction when it comes to compilers. Focusing more on parsing, grammars and automata theory. CE courses tend to go to a lower level. I enjoy both sides and can see what you're trying to explain. What I got out of university was mostly the feeling that the courses could never go in depth due to the amount of detail and the abstractions that exist. I don't think a CE course would have went to enough depth to cover everything anyways (each course is still just in the realm of 100 hours). Because both the layers on top (Algorithms and computation theory) and the bottom (hardware) both have so much information that it can be daunting to try to learn all of it.

I do agree that it might be a pain point for some people who don't enjoy that part of CS. I'm curious how you came to regret it this much?


> I'm curious how you came to regret it this much?

I'm not sure regret is the right word, but I am cynical because I was actively recruited by my university's CS dept. and it took me 2 years of study before I realized I was actually studying math, not computers, which was what I thought I was supposed to be studying. But I was mistaken, and it was too late to switch to CE and hope to graduate before the end of the decade. If I had been wise at 17yo, I would have known CE was what I thought CS was. Maybe I would not have worked in the field, but I sure would know a lot more about electronics and engineering, which was really were my young interests lay, I just didn't know any better.

Programming jobs pay pretty well, but if you have your CS degree, you can earn about 20% more starting out by landing a job as a computer scientist, or at least one that advertises for one. Programming is really a part of the IT field, but computer science isn't necessarily. For instance, the FAA uses computer scientists, so does the National Weather Service, so do major automobile manufacturers, the aerospace industry uses computer scientists. I mean, if programming is your thing, carry on, but you're actually limiting yourself if only seeking C++ programmer positions.


They didn't say anything about not liking maths, and they are specifically on this forum inquiring about a point of how computation can theoretically be done (and one which in this case is extremely relevant to mathematics, if you understood).

I can see that you needed to rant at someone about your revelation that humans can do computation too, but could you not have found a victim whose comment was at least superficially, tangentially related to what you're banging on about?


That's not what I meant. OP states in reply to my comment, "I unironically enjoy CS," followed immediately by, which could be interpreted as qualifying or laying the scope of the previous phrase, "I like programming and I like being close to the metal." That qualifier reveals some misunderstanding about what CS is, because it is not programming (though programming is a tool computer scientists may use) and it has little to nothing to do with "metal" aka hardware (yet another tool). I would have the same reaction to an English major that claimed, "I enjoy studying English. I really love typing and I love word processors." Because that is making the same mistakes as OP. Typing is a tool, word processors are tools, and they are not the study of English, though students of English invariably will utilize typing and word processors.


I believe you are being unnecessarily pedantic. Real world computer science programs cover software and hardware architecture as a matter of course. If they didn't 90% of them could be shut down for lack of interest.


You are mistaken about the curriculum, not that what you've said is entirely false, but that you are incorrectly attributing too much weight to these things in CS. Countless centuries before software or hardware existed, there were computer scientists. How could that be possible? Because software development is not computer science, nor is study of hardware architecture.

A decent CS program will include a some amount of history as well, who first did what and when, why the thing is called what it is, etc. Does that mean historians are computer scientists?! No, a university will round out a curriculum to include germane information that isn't really considered an essential element of that curriculum. In CS, the math is critical. So we can remove the computers and remove programming from the CS curriculum and still have a computer science program. But without the math, there is no CS.


I suppose there might be backward regions out there that can afford neither computers nor electricity adequate to teach computer science students anything they can't do with a pencil and paper, but they certainly are few and far between in the United States.

In general, it seems you are making an argument about semantics that doesn't describe the world as it is but rather either the way you want it to be, or the way you want it to be described. Either where all those poisonous influences were purged from CS programs or where "actual" CS programs were rare to non-existent.

You can certainly have a preference that the term be used like that, but that is far from reality on the ground.


I think, possibly, you are still stubbornly mistaking the meaning of the C in CS, as the two words in question are homonyms, so it is easy to do and extraordinarily common. Nevertheless it is incorrect, and no matter if every single person on the planet, and every single alien of every alien species in the whole of space-time from the Big Bang to the Big Dark Freeze or whatever thinks what you do, they'd still be wrong, as you are.

First of all, it isn't a pen and paper, or stylus on clay, or finger in the sand, just tools, that the digital computer replaces, it is a mind-boggling amount of time thinking, or figuring, or computing, too. You want to eliminate the thinking, but it is critical that you leave it where it is, because only humans, and possibly other mammals, and perhaps other animals, can think. Computers, like your PS5, can't think, and will not ever be able to. What you want to do, and others that believe falsely as you do, it seems, is dispatch with the person. Computer Science can't do that, and not ever, not until we can make people, and we already can so there's no incentive to find a much much harder and far far more expensive and much much much less fun way.

And people use computers for all kinds of things, but mostly, almost none of it is or is for computer science. And computer science is cheap, it doesn't cost anything but time (but I guess time is money, or the sqrt of evil etc.) So while it is luxurious to have a budget with a lot of cool tools available, there really are plenty of American and other Western universities that teach a decent CS curriculum without 4 computers per student, and their graduates are fully prepared for a CS career. I don't think you meant to insult them by calling them the Third World, but you kind of did that. A big budget helps any department. That doesn't mean money is computer science, although... computer scientists often work in economics positions and finance positions and secretly or officially do really neat computer science sometimes.

Secondly, fundamentally my argument is semantic, and your dismissal of semantics reveals some grave misjudgments. Semantics are of vital importance, and in every single exchange of understood communication, fully half the weight of everything there is in that is semantic in nature. If I don't understand what you're saying, or worse, if you don't understand what you're saying, then you can see there's a problem worth correcting.

The ideal candidate for computer science does not say, "I like computers, I want to work with computers, computers fascinate me. I want to know how they work. I love programming computers. I want to learn every language in use today, and be a programmer." The ideal candidate instead would say, "I love to compute, to figure. I want to know how to solve any problem. Puzzles fascinate me. I like figuring things out. I had exhausted my school districts math classes by the time I got to High School, and got a 4 on the Calculus AP exam in 10th grade. I audited some math classes at the local community college over the summers that I'm transferring in." Maybe that's a little too ideal. I was not, but math wizzes seem kind of common, I mean rare really, but always present. They make great computer scientists.

A computer science grad that wants to be a programmer will do well for themselves, programmers earn, and it is a highly respectable career. If nothing else will convince everyone, the difference in salaries between computer scientists and programmers should, except that some IT positions will inexplicably require a computer science undergraduate degree for $25/hr, so ignore the outliers when researching salaries.


Someone still needs to know the math too.


Click bait. It is not a "design limitation," it is a different design choice. Of course highly performant code needs to be aware of the selected architecture. Nothing to see here.


Ok, I've replaced that interpretation with the underlying factual description (32MB TLB) and left the OP's word 'bottleneck' in there.

Perhaps there's an interesting discussion to be had about it?


Thank you. Fair enough, there may be an interesting discussion lurking here somewhere about GPU architectures and rendering modes. However, Apple has been using this approach since the A11, so it is not new information. The OP assertion that Apple helped the WOW devs is probably also specious, since the app may have already been optimized (or close to it) for the architecture if they had been following Apple's GPU/Metal programming advice for the last few years. Caches always have limits and abusing those limits often has punitive results.


Those are synonymous terms with different spins. The bottleneck in getting software to run fast on GPUs is usually the performance engineering on the code and additional hoops and pitfalls of the platform substract from the real world performance users in practice will see. Yes, of course you can always say it's not "highly performant code" then... but it's more important to run apps well than benchmarks.


right, I'm sure that people who write low-level libs will be thrilled to optimize them for 0.1% of the iGPU market


Curious, where does the 0.1% come from? Apple Mac US market share is over 14% and iPhone market share is over 50%. I get that you are just being a bit hyperbolic, but if a low-level lib ignores the Apple Silicon GPU hardware they are ignoring a sizable market segment.

https://www.statista.com/statistics/576473/united-states-qua...

https://www.counterpointresearch.com/us-market-smartphone-sh...


I know this is pretty common in hn, but you need to remember that the US is just one country, and not even the most populous country, in this world.


Yes, I agree. That's why I pointed out that it was specifically US market share. I just found those quickly. The percentages are lower in other regions, but the total numbers are still huge.


0.1% isn't close to the truth worldwide either, so it's inaccurate regardless.


It's probably pretty close to the overall percentage of M1 Ultra owners, all things considered.


The question is what is the percentage of potential market for a specific product. If you are optimizing GPU code, that is a very different market from “all computers” or “all phones”. If the market depends on high performance GPUs, then the Ultra has potential to be a noticeable percentage of that market.


Well up until the M1 Ultra, almost no one was doing GPU intensive stuff on mac, since they didn't have a line with a decent gpu. As such, Apple is creating an issue for developers that only manifests on the $6000 version of their machine that requires rewriting your whole algorithm to work around.


Exactly. On top of that, Apple is still in the minority and refuses to support the Vulkan API they helped design. So, from a development standpoint, rewriting your software to support ~15% of your potential market isn't a very lucrative idea. Especially when the vast majority of those users will be running relatively incapable hardware, at least on the GPU side of things.


Well, this is already an improvement, the way the parent stated it: from "no Mac line with a decent gpu" to "only the $6000 version".


Well, over next releases, lesser Mac models will upgrade to something like M1 Ultra specs, as it happens, so that 0.1% is not exactly static.


Sure but what if the US Engineering population represents a large portion of your paid-user demographics. For a lot of HN's startups this is very much the case.


That's not very relevant for the purpose of this discussion though. For example, a common stat floating around is how iPhone has 20% market share around the world but provides 80% of in-app revenue, which is what businesses care about.


Based on the twitter thread this bottleneck only seems to apply to the M1 Ultra that is currently only in the $2k+ Mac Studio and no other Macs or phones so 0.1% doesn't sound too far off.


Nope. The same issues can be observed across all Apple Silicon GPUs but are most noticeable in M1s. Apple is currently the 3rd largest global PC vendor, if you add iPads they’ve been the 1st for quite some time. With wallet-share/spending power the target market is even larger. The question is whether developers should adopt Apple’s Accelerate/Metal APIs & optimise code for TBDR. It’s an obvious ‘yes’.


If you’re writing a low-level library, don’t you want it to target specific architectures? Wouldn’t you want the abstraction happening higher up?


That's just your opinion. This leaves a ton of performance on the table and saves Apple basically nothing in cost. Many of us would call that a design limitation, even a design flaw.


Where do you find that information? I couldn't find anything on apple.com where they detailed the various design tradeoffs they made working on their GPU


I think we're missing some info.

32MB of coverage is the same size for the TLB on a SM for a 1080ti. https://dl.acm.org/doi/10.1145/3491218 I don't think that's gone up a lot since then.

And GPUs in general are totally fine with stalling on memory accesses. That's core to how they work, and why you run into issues treating them like CPUs. They approach the problem of the memory hierarchy at a base level differently by accepting the fact that thread groups will stall on pretty much every memory access, and just barrel scheduling enough thread groups that you can keep ALUs saturated despite the memory dependencies due to massive parallelism. Cache and TLB misses aren't that big of a deal if you have enough work to cycle through. If you don't have massively parallel amounts of work, and can't just keep cycling through contexts, this is the main reason why some algorithms don't work on GPUs, but that's a global problem to how all major GPUs approach the problem of memory that's hundreds of cycles away.

> Hishnash: "What apps should be doing is loading as much data as possible into the tile mem and flushing it out in large chunks when needed. I bet a lot of the writes (over 95%) are for temporary values that could've been stored in tile mem and never needed to be written at all."

I mean, you shouldn't be performing large amounts of writes that aren't a completion of a program on just about any GPU architecture. If you care about the results, heavy random writes screws you just as hard in a traditional GPU architecture too. You really want to be able to look at the data flow and describe it in almost pure functional terms. If you can't look at a program and say "it writes here at this offset or offsets", you're going to have a bad time.

And finally, there's all sorts of ways you can crank up the GPU without doing writes at all, so the inability to turn it into a space heater just because of TLB pressure doesn't make a lot of sense. The xor chains that something like the bogomips calculator uses will eat up every cycle you give it and make their way through the compiler and hardware schedulers all without blocking on memory at all.


What does he mean by a 32MB TLB? Certainly there isn't an actual 32MB of on-chip memory dedicated to a TLB. So does it refer to the coverable amount of virtual memory? But that can't be true either, given huge pages. All in all, very disappointing thread. Does anybody have real references to what he's talking about?


I think it's a confusion of similar terms - there's the TLB (translation lookaside buffer) of the MMU, which is an extremely small fast buffer of recent MMU translations - there's no chance that is anywhere near 32mb. This is often designed to be hit within a single cycle, so has tighter restrictions than even the l1 caches of modern CPUs.

Then there's also the GPU tile buffer (which they may also be calling a "TLB" here?) - which I think is what he's referring to here - on a TBDR system there's an advantage if it can calculate multiple passes on the image while keeping all needed info in the on-chip tile buffer (Think stuff like depth and color data when rendering a large scene), as if the intermediate results aren't actually needed (such as a partially drawn scene where more objects will be drawn on top of some of the pixels, or an intermediate output before some post-processing pass) it may never actually touch the main memory so has the ability to save a lot of bandwidth.

I'm pretty sure it's the second the OP is referring to here, as if you blow past this buffer size you can hit a performance cliff, and methods to optimise it's use (such as rendering tiles of multiple passes instead of doing each pass on the full screen view separately) doesn't really make any difference to immediate renderers, so is something "new" developers need to take into account for deferred tiler architectures.


I'm an undergrad and know nothing about GPUs so forgive me for these questions (I tried reading a bit just now). Where does the tile buffer come into play for GPGPU workloads?

From my simple understanding of it. TBDR GPUs extract performance by tiling and binning primitives and handing that out for rasterization. The multiple passes allow it to work on both at the same time? Kinda like how a CPU pipelines stuff? I thought GPGPU workloads means that it skips the rasterization stage? So what's the problem with treating it like an IMR.


A modern renderer does rasterization followed by a sequence of post-processing steps that you can think of as compute dispatches (kernels) running one thread per pixel. The reality is often somewhat more complex, but that's a good first approximation.

Those steps tend to be local, so instead of running step 1 on the whole screen, then step 2 on the whole screen, and so on, you could run all steps on the top-left tile (of e.g. 128x128 pixels), then all steps on the next tile, and so on. The downside is that you'll likely have to compute some data in the boundary regions between tiles multiple times. The upside is that the bulk of intermediate data between post-processing steps never has to be written out to and read back from memory (modern render targets are too large to fit into traditional caches, though that may be different with the huge cache AMD has built for their latest GPUs).

The same principle can be applied to GPGPU algorithms that have similar locality. This tends to be discussed under the label of "kernel fusion".


My understanding is, that the rasterization process also happens per tile.

One of the cases where this can cause performance problems is, when you want to read the output of a previous render pass. If you want to be able to read arbitrary parts of the output of the previous render pass, the output buffer probably needs to be copied from the tile memory into a memory that can hold the whole buffer at once. Furthermore, this also means that all tiles of the previous render pass need to execute before the next one can run. This limits how much work can be done in parallel.


Modern machine can do TLB and L1 lookup in parallel. Here is how it works on a traditional CPU (it is different on M1).

The page size is 4kb. This means the lower 12 bits of an address is the same between logical and physical addresses. The cache line is 64 bytes. The lower 6 bits of the address are indexing within a cache line. The L1 is 8 way associative so the other 6 bits addresses 8 cache lines in the L1. This makes 64*8 cache lines of 64 bytes => 32k of L1 cache.

The CPU does a lookup in TLB and a lookup in L1 in parallel, and gets 8 cache lines from the L1, which are filtered by the results in the TLB to hopefully get a hit.

Now you'll note that, while most CPU have 32k of L1, the M1 has 128k, which means it needs 2 extra bits to match between physical and logical addresses to pull the same trick. And what do you know, M1 has 16k pages! What a coincidence (not!).


Fourth tweet states that he does mean translation lookaside buffer. If it’s true, it’s indeed insanely big.


They also refer to it as a "Transaction Lookaside Buffer", which reinforces the impression given off by the overall twitter thread that this was secondhand information relayed by somebody who has a very weak grasp of the technical details involved.


Vadim is a smart guy but not a programmer.

So you’re right he may not mean Transaction Lookaside Buffer.


I'm pretty sure that's false, as translation lookaside buffers simply cannot be of that size while hitting the performance requirements.

I suspect they've seen someone refer to a TLB (Translation Lookaside Buffer) and TLB (TiLe Buffer) and conflated the two.


32Mb may mean the total amount of memory that can be mapped at once in the graphics TLB .... Main cpu usage of TLBs is likely very different from graphics usage (which is pulling all the data for a frame 60 times a second) ... For some workloads 32Mb might be tiny and result in the TLB thrashing


I suspect it’s 32MB of coverage, which would be 2,000 entries which is a typical TLB size these days. Although then what’s he talking about the tile buffer?


2000 is probably the L2 TLB size, L1 TLBs are usually in the order of 10s of entries (but as I pointed out the decisions you make for graphics engines are very different from those you make for CPUs)


It must be the coverable amount. At 16kB pages, it would be 2048 entries, a reasonable TLB size.

The CPU is said to have 3072 L2TLB entries, so that's not quite a match. Perhaps this "32MB TLB" is a GPU-only TLB; perhaps it does not support variable page sizes.


But the same limits are true on any GPU that does virtual memory - it's nothing special about that and doesn't interact really with the TBDR vs Immediate rendering the thread seems to suggest. If anything, a TBDR implementation may put less pressure on the MMU TLB, as the tile buffer contents are statically allocated based on the pipeline so may not even have a backing mapping, while on an intermediate mode renderer would have to have an intermediate buffer somewhere for stuff like the color and depth values of an ongoing render, which is likely translated through the MMU.

Which is why I believe they've confused the TLB (Translation Lookaside buffer of the MMU) with the Tile Buffer of a TBDR renderer (IE the on-chip working buffer for in-tile intermediate results of shading).


I don't know anything about TLBs but the M1 is a system on a chip, you can get up to 64GB of RAM shared between CPU and GPU all on the same die.


More specifically, it's a system on a _package_. A TLB lives on the _die_, though, with one in each core.


TLB cache isn't the same thing as the unified memory on the M1.


Good thing that we have Twitter since Apple doesn't know what they're doing, right? ;-)

And it's too bad that apparently the page size can't be changed for larger memory systems to reduce TLB misses (at the expense of fragmentation) which is one traditional solution that doesn't require software changes. 16KB seems like a very small page size on a 64GB system. If we scaled an 8MB VAX with 512B pages up to 64GB it would have a page size of something like 4MB.


It sounds like it might be possible, at least at the hardware level. https://github.com/xmrig/xmrig/issues/2060#issuecomment-7702... (bug on some random crypto miner benchmark software, I guess) mentions a 32 MiB page size, although that might be with Linux on M1 rather than macOS on M1.

This "32MB TLB bottleneck" is a weird thing to say. The TLB size is typically stated in terms of pages, right? with huge pages (or "superpages", as macOS may call them?), that should be a lot more than 32MB of total memory.


It absolutely can! On ARM you can just enter a block entree in a non-leaf page table to allocate an entire block of physically contiguous memory in a single TLB entry. From the ARM docs [1], any CPU supporting 16K translation granules will support 32MB L2 blocks.

I suspect, however, that apple can't really allocate that many (if any at all) 32MB regions of memory at runtime due to fragmentation unless they've substantially changed their contiguous memory allocator since I last looked.

[1] https://developer.arm.com/documentation/den0024/a/The-Memory...


Interesting, thanks!

> I suspect, however, that apple can't really allocate that many (if any at all) 32MB regions of memory at runtime due to fragmentation unless they've substantially changed their contiguous memory allocator since I last looked.

That's unfortunate but at least is something they could fix in upcoming software versions. Some folks have put a lot of effort into huge pages on Linux, and it's still ongoing. [1] Not too surprising that macOS could have some room for improvement...

[1] https://lwn.net/Articles/887753/


why do you say the page size apparently can't be changed for larger memory setups?


"Unlike a CPU, when a GPU is waiting for data, it can't just switch to work on something else." https://twitter.com/VadimYuryev/status/1514295693581586434 Is this meant to be qualified with (Apple) GPU? Otherwise it sounds like the literal definition of latency hiding that has been the norm on desktop GPUs (and one of the first things taught to newcomers of the OpenCL/CUDA programming model) for a while.


I guess the point here is that GPUs don't have general purpose tasks doing independent work and an OS scheduler. You might end up masking part of the cost of some stalls if you are able to swap in other ready-to-run tasks. The bus latency you're describing doesn't exist when you don't need to copy to a GPU-dedicated GDDR/HBM bank. But while that problem goes away, it sounds like this new one of TLB pressure shows up.

I'm no expert but I suspect the mobile SoCs like Bionic and Snapdragon have the same concept as the M1 Ultra with respect to integrated GPU sharing memory with the apps cores. M1 probably inherited it from Bionic? So some of this work of porting compute software to reflect this environment may have already started. I guess the challenge is that the bar is higher for expectation of GPU performance in a desktop system like the Mac Studio.


I don't know much about the M1.

But I can say for sure that Intel iGPUs, AMD GCN, AMD RDNA, AMD CDNA, and multiple NVidia-generations of GPUs all have hyperthread-like rescheduling of independent workgroups.

In fact, something like 8x wavefronts / warps run in parallel on modern GPUs. When one wavefront / warp stalls due to a memory read/write (or a PCIe read/write), the GPUs universally "hyperthread-out" and hide the latency.

Its "different" from how CPUs do it, but the fundamental principals are the same. (CPUs have a redundant set of registers tracked in a register file. GPUs on the other hand, have a set of registers and the kernel-scheduler (or whatever handles CUDAstreams) carefully assigns those registers to not conflict with any running wavefronts).

-------

The statement so listed is blatantly false, at least for Intel, AMD, and NVidia GPUs. Maybe Apple iGPUs are built different, but I find that unlikely.


> The statement so listed is blatantly false, at least for Intel, AMD, and NVidia GPUs. Maybe Apple iGPUs are built different, but I find that unlikely.

The statement is for Apple GPUs only, that’s the whole point. Software can be easily ported to Metal (in a weekend according to Roblox devs) but until it’s optimised for TBDR it will underperform.


> You might end up masking part of the cost of some stalls if you are able to swap in other ready-to-run tasks.

You'd need SMT to do this for memory stalls, and Apple M1 doesn't use SMT - they have the same amount of logical cores (hardware threads) and physical cores.


Source? Every unified programmable GPU I've seen uses SMT, including the PowerVR GPUs going back to the SGX days. It's core to how they approach modern memory hierarchies.


Looking into it more, AGX2 (like pretty much every fairly high perf modern GPU) is heavily SMT, allowing up to 1024 simultaneous threads per core depending on how many registers each shader invocation needs.

https://rosenzweig.io/blog/asahi-gpu-part-3.html


He says the entire thread group needs to be stalled. The thread group is (on M1) 32 SIMD lanes that have to execute basically the same control flow. Presumably other thread groups can continue executing.


He focuses early on with overkill on the cooling — could it be that Apple knew this was a tradeoff, knew it would tend to not push the power envelope, but wanted a cooling system that would let them leave the general design unchanged for years to come, with lots of headroom for future cpus? Why call it an error when there is an actual reason that makes sense as an explanation?


Unlikely. Historically Apple has always designed cooling with very narrow margins around suppliers TDP due to internal pressure to create smaller and smaller machines, which has often resulted in underclocking due to GPU vendors downplaying their TDP requirements.

From the explanation it sounds like an artificial limitation. The chip is technically capable of being saturated with the current design given an optimal program, and fulfilling it's TDP, at which point the current cooling system design would be _entirely necessary_ for the current chip design. This is clear from the second tweet:

> Mac Studio cooling system is OVERKILL in most apps

My emphasis added (not all apps in theory - but practically)


The Mac Studio is huge compared to the Mac mini. I don't think they cared too much about slimming it down.


> The Mac Studio is huge compared to the Mac mini

True, but Apple generally do care about either size or quietness. Starting with the G5 cooling system they started to exchange a lot of the former for the latter in their large desktops. I expect that is the case here (i've not actually seen a picture of this thing) - However this doesn't mean they designed the cooling system extra large margins, that's just not their attitude, unlike PC vendors they control all the components they can be more exact (Saying this as quite the opposite of an Apple fan boy).

Point is, the cooling system overcapacity is due to expected (but generally not fulfilled) TDP, rather than adding arbitrary large margins.


Yes, although it's worth noting that they seem to have rowed back on this a little bit in their recent designs. The new MacBook Pros are cooler due to the processor being cooler, but they also have feet to allow better airflow under the machine at the clean flat bottom of the previous models (IMO they still look great, but I can't see Apple of a few years ago making that decision).


Their new CPU architecture certainly is more power efficient compared to x86s they replaced in many ways which means they aren't up against such a wall to compete on performance, perhaps a combination of circumstances have caused them to produce more thermally comfortable machines in recent years. I'm not sure how long this will last though... in that sense I'm sure they do have relaxed TDP margins, so i'm basically arguing about the subtlety of this not being an intentionally large margin... now I look like i'm splitting hairs :P but it's clearly a mistake from the analysis of the GPU.


I don't think it is overkill. Hypothetically, there are apps that will load the M1 Ultra beyond 120W; they just haven't been written yet. If such apps exist in the future the cooling will be needed.


It is a strange callout.

Apple for a few decades now has been keeping the same enclosure for at least three product iterations whilst upgrading internals.

So the cooling solution would need to support M2, M3 and maybe even M4.


An overkill cooling system could also be nice in a couple years when it is dusty and working at reduced efficiency.


I also figured it had been designed so that it could be quiet - if you can keep the volume of the cooler large enough then you can use slower fans right?


The cooling system has to be designed for when users attach devices to those Thunderbolt 4 ports _and_power_them from it.

That shouldn’t produce much heat inside the unit (they claim the power unit can continuously deliver 370W. Guessing 95% efficiency and 200W delivered over those buses it can’t lose more than 10W), but may just push it over some edge.


On one hand, I agree with you on the headroom since cooling has become far more complicate than just putting a fan so you really don't want to redesign over and over again. But on the other hand, from what I heard, the cooling on the Ultra edition is super heavy. So in that sense, it's indeed slightly overkill.


https://threadreaderapp.com/thread/1514295682777059329.html

The tweets refer to Tile-Based Deferred Rendering (TBDR) [1] as compared to Tile-Based Immediate Rendering (TBIR).

Here's another take on it: https://docs.imgtec.com/Architecture_Guides/PowerVR_Architec...

Note that TBIR appears to be a term the author just made up to refer to apps that have not be specifically optimized for the Apple M1 Ultra GPU architecture.

[1] https://developer.apple.com/documentation/metal/resource_fun...


>Note that TBIR appears to be a term the author...

Tile-Based Immediate Rendering, or Tile-based immediate mode rendering (TBIM) is quite well known and not an invented term. [1]

Given this is from MaxTech. I nearly stopped reading when he listed "they started working on the chips 5-7 years ago". But since he quoted hishnash, may be worth a read anyway.

I also doubt the the so called GPU limitation is due to this. The GPU TDP usage depends on a lot of other factors.

This TBDR limitation, if you can call it that has been well known since PowerVR / Dreamcast's days ( Someone please make a low cost Super-H Chip ). Most of the explanation seems very weird from a cache and memory access perspective. And as @lamchester pointed out later [2], a lot of it doesn't make any sense. And remember, M1 Ultra is two M1 Max. Which means there are many some small details missing in how it works together.

[1] https://en.wikipedia.org/wiki/Tiled_rendering

[2] https://twitter.com/lamchester/status/1514346083819786240


I've never heard it called TBIR, always TBDR or just tile based rendering.

I honestly can't make heads or tails of the Twitter thread. It talks about having too many small reads/writes but that's the whole point of tile based rendering. You use the tile as scratch space to do operations that are expensive in system memory.

This approach is well over a decade old at this point and pretty ubiquitous in power constrained rendering scenarios[1]. Best I can tell you need to take architecture into consideration when trying to get the peak GPU usage which has always been true of any platform regardless of tiled or immediate mode rendering.

[1] https://developer.qualcomm.com/sites/default/files/docs/adre...


Accessing tile memory doesn't need an address translation; it's not part of virtual memory. Accessing global memory does.

I think the claim is that the typical working set can exceed what the TLB can map once the GPU scales to 32 or more cores, so the TLB starts thrashing. Optimizing for TBDR would alleviate this because all the tile memory accesses would bypass the TLB, and also likely reduce the working set because the intermediate buffers don't need VM mapping.


Yeah but a TLB is for user space virtual memory mapping. When your talking about GPU operations those usually go through opaque handles or allocated buffers and the DMA/read/write is done outside of user space(aside from initial data uploads or the case where you have unsynchronized user space memory buffers, but that's not usually the norm). Some of this changes a bit with Vulkan but it's not really clear from the thread how a TLB fits in.

Optimizing for TBR involves reducing drawcalls, buffer target switches and operations that are orthogonal to what a TLB historically does.


Pretty much all modern GPUs, discrete or integrated, use MMUs/TLBs to access out of core memory. I can confirm that the PowerVR has had an MMU since at least the SGX days (the page table management can been seen in the (barely) Linux kernel portion of their driver that they open sourced eons ago). Most of the other integrated GPUs got on board a few years later. Discrete GPUs have had MMUs over a decade since that gets them better perf, as there's less validation of the command buffers needed if worse case you can only crash your own user context. that_was_always_allowed.jpeg Leaning into those ubiquitous MMUs is half the point of Mantle/Vulkan/DX12/Metal, since you can rely on hardware protection and get to remove the older standards' required validation checks if you can only crash your own process's code (ideally, baring bugs that for sure exist).


...there are modern integrated GPUs that don't use virtual memory for all global resources? I expected that was a given to even just have an MMU enforce security when accessing RAM.


I'm not privy to the security mechanisms of the integrated GPUs I've worked with however I've got a decent amount of experience with mostly tile based GPUs and I've never hit TLB misses as a reason the GPU is going slow. On the CPU? Sure, but unless they want to get a lot more specific on what these optimizations are it's a bit hard to piece together what's going on here.

GPUs don't like to read around in memory(neither do CPUs for that matter but then tend to branch and be a lot less predictable) so optimizing reads/writes are something you will do regardless. The painful parts of TBRs usually(but not universally) have to do with drawcalls setup and render target switching. I remember some of the early iPhones had pretty painfully low drawcalls limits associated with the overhead with dispatching the actual calls as opposed to triangle count or fillrate. It's fairly easy to put together the synthetic benchmarks to shake that out.


Well the claim is also that it isn't generally an issue for the M1's architecture until you scale up to 32 cores or more [1], which is beefier than any other current GPU outside of AMD and NVIDIA's offerings. So it's completely believable that the GPUs you worked on could have used virtual memory for everything without it ever becoming a performance issue.

(alternatively: it looks like the M1 Pro only has 24 MB L3, so it's likely that actual cache misses become an issue before TLB misses, but the Max/Ultra have more cache so that could invert)

[1] https://twitter.com/VadimYuryev/status/1514295707481501700


The claim is that if you don't "optimize" for TBDR then you are gated by the TLB and that software written for immediate mode rendering isn't optimized this way.

However drawcall batching, minimizing render target(and state!) switching are all bread-and-butter performance optimizations you'd make for an immediate mode renderer as well. You can get away with more drawcalls and there are specific things you can do on a per-architecture basis but they don't involve anything having to do with TLBs to the best of my knowledge. If someone wants to spill the beans on how the M1 is somewhat different here I'm all ears but it just doesn't line up with most GPU architectures I've seen.

The big wins between TBR and IMR usually involve getting larger tiles through less usage of rendertargets[1] as it increases the total size of the tiles which drives efficiencies in dispatching the tiles.

[1] https://developer.qualcomm.com/sites/default/files/docs/adre...


I don't know how modern GPUs work exactly, but you could enforce security on physical addresses with a significantly less demanding TLB circuit, and you could use big swaths of memory so the bounds are guaranteed to fit in your memory logic, no buffers or caches needed.


M1 is a unified memory architecture, so everything goes through the TLB. Although these days most GPUs can access some unified memory and have TLBs.


While the AGX does have MMUs and TLBs, unified memory and TLBs are orthogonal. For instance the Raspberry Pi's SoC has unified memory, but no MMU/TLB on the GPU. That IP block can only access physical memory.


You can make your own low-cost SuperH as SuperH patents (but not brand name) have expired. If you are interested write to the j-core mailing list with what kind of chips (is it more micro controller or older mobile class chip) you are interested in and see if they have a chip already that suites that need. If the people behind J-Core ever print silicon they surely won't mind printing and packaging in larger numbers and sell some as part of the batch.


> MaxTech

Is he popular? I’ve only run across his videos once or twice but I get the feeling our estimations of his technical knowledge match.


Yes, he is popular. Yes, he loves click bait. No, he does not have a deep understanding of technical subject matter. All that said, as a journalist he seem to genuinely try hard to get to the bottom of technical issues relating to Apple and report on them.


"Journalist" might be a bit strong? He is clearly pretending to have a much deeper understanding of the subject than he does, meanwhile throwing around accusations of design flaws...


Bigger Cache would help GPU Apps which are not optimized for a tile based approach perform better. That this becomes a constraint on the Ultra and some apps some to suffer worse scaling on the Ultra the other products in the line seems reproducible though.


What is TBIM? Is it a tile based render that doesn't bother rejecting fully occluded geometry? If not, what is it?


So the "design limitation" in the chip can be overcome by updating apps to support Apple's TBDR tile memory system that was presumably created to maximize processor performance? Am I understanding the design limitation correctly?


It's bait for bad journalists.


Many people would call it a design flaw because it leaves a ton of the chip's performance on the table for common workloads in exchange for essentially nothing.

TLB size is not where you're cutting costs. If you need to make it bigger so that common workloads can max out the other parts of the chip, it's a no-brainer decision. This isn't a sensible tradeoff.


This was an architecture change. The hoops one had to jump through going from intel to ARM were crazy in lots of other places. Or on gaming platforms (in the past).

What I'm confused with is how many silicon engineers are so easily able to make the absolute determination that "This isn't a sensible tradeoff."

The tweets terminology frankly seems weak, I've never even heard of a 32MB TLB - that's insanely huge.


How do you know it was for nothing? Do you have some insider info on the decision?


Do you have insight into the design trade offs made by the apple silicon teams? Because it sounds like Apple's silicon team disagree and took a different design approach to what you would do in your designs?


But those common workloads might just have to tweak their allocator to request larger pages? That's not a very big demand for an architecture change.


I note the only metric they offer is power usage. A poor substitute for actual performance numbers. In particular in any given CPU/GPU you could have all cores busily computing, nothing stalled and still be a good way below maximum power. To hit maximum power you need to carefully construct software that will fully utilize all functional units as well as causing maximum toggling within computation paths (1 + 2 will toggle fewer bits than 0xffffffff + 0xffffffff).

The thread doesn't give any clear indication if their explanation that the GPU sees significant stalls due to waiting on TLB miss has actual data behind it or is pure conjecture based upon observed power usage.


> "The M1 family of chips, including the M1 Ultra, has a major limitation that can't be fixed unless apps are properly optimized."

So am I understanding it correctly - if an app is properly optimized - then it's not even an issue?


Any time there's a massive change in architecture, it should be expected improvements will be found in tweaking.


I may have misunderstood things, but isn't the gist that software written correctly for this unique hardware isn't subject to the bottleneck? Is it really a hardware problem then, if it can't accommodate suboptimal software with maximum performance? This is after all not an Nvidia or AMD/ATI GPU architecture, and I wonder if anyone ever brought a similar argument forward in the many cases of this or that game having distinctly poor 3D performance on any of their hardware.


<laughs in mobile developer>

Seems a bit apocalyptic. Tile based architectures are everywhere in the mobile space. Is it really so inconceivable that macos devs can adjust to it?


Apple uses a radically different architecture and apps need to be redesigned in order to take full advantage of the hardware.

It's nothing new.


This guy knows nothing about computers. He is a Youtuber with zero expertise in anything but creating Youtube videos.


The author of those tweets doesn't seem to understand what TLB is (their terminology is off), doesn't seem to understand how GPUs work (they are explicitly designed to hide memory stalls by switching to a different kernel, which is why they have huge register files), doesn't seem to understand the memory hierarchies (why would a GPU have it's own TLB in the first place?). Besides, we know that M1 comes with a 3072-entry TLB that matches it's maximal 48MB cache size, so where did they come up with the "32MB figure anyway"?

It is entirely possible that M1 Ultra is bandwidth starved for some workloads, but that can be demonstrated using GPU profiling tools (which Apple readily provides) and not by musing about technical concepts that one barely understands.


I agree with most of what you say but

> why would a GPU have it's own TLB in the first place?

is because modern GPUs have their own MMUs and have for quite a while. More or less ubiquitous MMUs on GPUs are half the reason why Mantle/Vulkan/DX12/Metal came into being.


Thanks for this! Can you elaborate a bit more and maybe give me some pointers where I can read about these things in more depth? I of course know about the MMU on traditional dGPUs but I was not aware that “integrated” GPUs also have a separate MMU. How does it work and why is it necessary given that the GPU shares the last level cache and the memory controller with the rest of the system?


Because the TLBs and other MMU hardware aren't in the last level cache or memory controller on the vast majority of systems, they normally sit about at L1 on each core. This is because you want at least L2 to be speaking completely in physical addresses so the coherency protocol isn't confused by the same page mapped in different ways. L1 is more complex typically being virtually indexed, but physically tagged. So when a access is issued with the virtual address, in parallel to the cache set look up, the TLB translates to a physical address, and then the physical address is compared on the tags of the resulting cache line ways in the set that matched the virtual address in order to select the actual cache line for the op.


I'm curious how they claimed that the cooling solution is over-designed because they were only able to achieve 80% of the spec'd power target (most likely TDP since people stopped using max power in the 00's). There are a number of factors that go into spec'ing this value, typically quite a few test vectors to determine typical workloads and high-power targeted workloads. A mistake in power budget of 20% is gigantic and would have significant impact on sorting die. Now maybe Apple doesn't care about this because they aren't a silicon vendor, but that budget comes out of battery life and cooling solution space. My point is, I find it highly unlikely Apple overstated their power target by 20% given how much money it would have saved them. But maybe they did?


Well, 20% headroom on a badly optimised workload that can't make full use of the GPU seems decent. Run some better code and you'll need that 20%.

As covered in other comments, the thread author has no idea what they're writing about.


Everyone in this comment section is saying "it's nothing new" as if that excuses the issue. From the way it's described this is a software optimization problem. So the question is, how hard is it to optimize an existing application?

If it's a few quick changes and a rebuild, then it's nothing. If it's anything more you'll have an Itanium-like product flop. Maybe. It's also possible that node size and computational headroom relative to the software of the day will make it hard for consumers to notice or care.

Either way this will be interesting to watch.


> So the question is, how hard is it to optimize an existing application?

TBD how hard it is to optimize an existing application to remove this bottleneck. The tweets seem to suggest it's pretty involved. Alternatively, there might be a simple way to use a larger page size (called "huge pages" on Linux, possibly "superpages" on macOS), which would greatly relieve the TLB as a bottleneck, if TLB = translation lookaside buffer. The tweets seem confused enough that I'm not sure.

> If it's a few quick changes and a rebuild, then it's nothing. If it's anything more you'll have an Itanium-like product flop.

Meh, if it means many applications only reach 86W rather than 105W, that's probably not world-ending. Most applications on any platform have some bottleneck that they could avoid with more optimization effort, whether it's the TLB or something else. Users usually don't complain in spite of these bottlenecks. And:

> It's also possible that node size and computational headroom relative to the software of the day will make it hard for consumers to notice or care.

Yeah, people seem pretty happy with these processors.


In general this account (Vadim Yuryev/MaxTech) churns out a lot of specious and empty contents on YouTube at a ridiculous pace. One of the worst highly-subscribed "tech review" channels out there. Not that surprised to see this thread being subject to heavy discussion and inaccuracies being pointed out.


I'm confused. A 32 megabyte TLB is huge. TLB entries are tiny objects; and each one caches the mapping for an entire VM page. The larger the page size, the more "leverage" the TLB has.

Maybe the blogger means that the TLB controls a VM working set or footprint of 32 MB?


So it can only use 68w instead of 105w? Thats the crux of the complaint here?

This honestly makes me want to throw out all my other computers and GPUs that do whatever they do at any performance metric simply because they use so much more energy to do so


If you just use regular vertex and fragment shaders then all of that is automatically handled. A game like World of Warcraft might not use advanced GPU features that would slow down a TBDR.


"If an application has not been optimized for the M1 GPU architecture's tile memory"

Huh. What if it's written in crayon? Is the TLB a problem then?


can this be fixed by good toolchains or do people have to #ifdef their code around it?


They probably have to #ifdef.


Y’all just got nerd sniped.


god why can’t he write a blog post? 5000 tweets in your face


At least he could just take best practices from the Twitter CEO and post text as a picture.


Everyone calm down. Not every observation on Apple's stuff is earth-shattering. In fact... none of them are, but Apple is of course partly or mostly to blame for revving up everyone's engines.

The M1 Ultra Studio is not what you think it is. It is not Apple's flagship high-performance machine. It isn't. That is still the Intel Mac Pro. The Studio is the Mac Mini update that was rumored 6 months before. The Studio is a Mac Mini, hey just gave it a new name. IOW the Studio is the low-end and entry level machine for professionals.

Also, the current newest generation Apple hardware is not a revolution. Certainly, Apple deserves praise for seamless platform switching, nice work, truly. But these new Macs are only a little bit more performant than the previous generation. Check the benchmarks. Apple's best years were 2010-2012, where each year they doubled the performance of their machines of the previous year. It took another 6 years for performance to double again. The M1, M1 Pro, M1 Max, and M1 Ultra are all marginally faster than the previous generation, the Intel Macs. Except for 2010-2012, which was amazing, this slight increase in performance is entirely typical of new Macs compared to whatever model's previous iteration.

This twitter guy is talking about GPU and games. I realize GPU is important to the computing industry, but not to me. I've been a Mac user since 1989, and even if it was and is included in the machine, I've never once used the GPU. It just sits idle there while I saturate the CPU.

Macs are tools for professionals. Professionals don't play games with their tools. If Apple has somehow offended the gaming community by not catering to their needs, and also boxed the GPU performance in the M1 Ultra, it's really a win-win. Ok, it's a little embarrassing, but regardless, doesn't deserve the expected outrage from the sensationalist reporting. How dare Apple! Right? It's more like, ok, well, that sucks a little, but who gives a shit?


More and more (professional) apps use the gpu for compute. So does macOS. The days of using a gpu only for games or even graphical apps are long over. I’m pretty sure yours isn’t as idle as you think it is.


> More and more (professional) apps use the gpu for compute.

I expect 3D rendering applications, video editing applications and image manipulation applications to use GPU, as they always have since they first could. Machine Learning uses GPU, but the bigger M1 chips have the Neural Engine for that, so maybe not so much anymore on macOS. I doubt anyone would bother mining crypto on commodity hardware. But when you refer to, "more and more (professional) apps," I don't know what you're referring to. DAWs will only use GPU for displaying the interface, iow, barely perceptibly, not for applying filters or effects on audio. Spreadsheets, accounting software, financial software, municipal software, inventory software won't need GPU. Desktop publishing applications don't need GPU. CAD software doesn't need GPU. Beyond rendering, video editing, accelerated image manipulation, Machine Learning, cryptomining and games, I don't know for what else a GPU can be used.

macOS uses GPU to drive displays and for the GUI, but hardly, and otherwise, not so much. Why would it? How could a GPU make system operating more efficient? Open up your Activity Monitor, under the Window menu open the GPU History window, and you can watch how macOS doesn't really utilize the GPU. It must use it for Quartz screen rendering, but it is such a tiny amount it doesn't even register.


> Spreadsheets, accounting software, financial software, municipal software, inventory software won't need GPU. Desktop publishing applications don't need GPU.

All of those use the GPU since modern UI toolkits are designed to use GPU offload.

> CAD software doesn't need GPU.

Wat.


> Everyone calm down.

I'd say the overall tone of comments is pretty skeptical.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: