Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's a contributing factor. If things like retain/release are fast and you have significantly more memory bandwidth and low latency to throw at the problem, you can get away without preloading and caching nearly as much. Take something simple like images on web pages: don't bother keeping hundreds (thousands?) of decompressed images in memory for all of the various open tabs. You can just decompress them on the fly as needed when a tab becomes active and then release them when it goes inactive and/or when the browser/system determines it needs to free up some memory.



You've completely changed the scope of what's being discussed, though. Retain/release being faster would just surface as regular performance improvements. It won't change anything at all about how an existing application manages memory.

It's possible that apps have been completely overhauled for a baseline M1 experience. Extremely, extraordinarily unlikely that anything remotely of the sort has happened, though. And since M1-equipped Macs don't have any faster IO than what they replaced (disk, network, and RAM speeds are all more or less the same), there wouldn't be any reason for apps to have done anything substantially difference.


From the original article:

Third, Marcel Weiher explains Apple’s obsession about keeping memory consumption under control from his time at Apple as well as the benefits of reference counting:

>where Apple might have been “focused” on performance for the last 15 years or so, they have been completely anal about memory consumption. When I was there, we were fixing 32 byte memory leaks. Leaks that happened once. So not an ongoing consumption of 32 bytes again and again, but a one-time leak of 32 bytes.

>The benefit of sticking to RC is much-reduced memory consumption. It turns out that for a tracing GC to achieve performance comparable with manual allocation, it needs several times the memory (different studies find different overheads, but at least 4x is a conservative lower bound). While I haven’t seen a study comparing RC, my personal experience is that the overhead is much lower, much more predictable, and can usually be driven down with little additional effort if needed.


But again that didn't change with M1. We're talking MacOS vs. MacOS here. Your quote is fully irrelevant to what's being discussed which is the outgoing 32gb macbook vs the new 16gb-max ones. They are running the same software. Using the same ObjC & Swift reference counting systems.


We've run full circle there

ARC is not specific to M1, BUT have been widely used in ObjC & Swift for years AND is thus heavily optimized on M1 that perform "retain and release" way faster (even when emulating x86)

Perfect illustration of Apple software+hardware long term strategy.


That still doesn't mean that M1 Macs use less memory. If retain/release is faster then the M1 Macs have higher performance than Intel Macs. That is easily understood. The claim under contention here is that M1 Macs use less memory, which is not explained by hardware optimized atomic operations


And I never stated that. It's just more optimized.


Ok. However the posts in this thread were asking how the M1 Macs could use less RAM than Intel Macs, not if they were more optimized. The GP started with:

>This quote doesn’t really cover why M1 macs are more efficient with less ram than intel macs? You’ve got a memory budget, it’s likely broadly the same on both platforms


Well, if less memory is used to store garbage thanks to RC, less memory is needed. But that was largely discussed in other sub-comments hence why we focused more on the optimisation aspect in this thread.


>Well, if less memory is used to store garbage thanks to RC, less memory is needed

But both Intel Macs and ARM Macs use RC. Both chips are running the same software.


Aren't most big desktop apps like office on PC still written in C++? Same with almost all AAA games. And the operating system itself.

Browsers are written in C++ and javascript has full-blown GC.

I don't see how refcounting gives you advantage over manual memory management for most users.


Decompression is generally bound by CPU speed, not memory bandwidth or latency.


CPU speed is often bound by memory bandwidth and latency... it's all related. If you can't keep the CPU fed, it doesn't matter how fast it is theoretically.


What I mean is that (to my understanding) memory bandwidth in modern devices is already high enough to keep a CPU fed during decompression. Bandwidth isn't a bottleneck in this scenario, so raising it doesn't make decompression any faster.


RAM bandwidth limitations (latency and throughput) are generally hidden by the multiple layers of cache in between the ram and CPU prefetching more data than is generally needed. Having memory on chip could make the latency less, but as ATI has shown with HBM memory on a previous generation of its GPUs its not a silver bullet solution.

I am going to speculate now, but maybe, just maybe, if some of the silicon that apple has used on the M1 is used for compression/decompression they could be transparently compressing all ram in hardware. Since this offloaded from the CPUs and allows a compressed stream of data from memory, they achieve greater ram bandwidth, less latency and less usage for a given amount of memory. If this is the case I hope that the memory has ECC and/or the compression has parity checking....


> I am going to speculate now, but maybe, just maybe, if some of the silicon that apple has used on the M1 is used for compression/decompression they could be transparently compressing all ram in hardware. Since this offloaded from the CPUs and allows a compressed stream of data from memory, they achieve greater ram bandwidth, less latency and less usage for a given amount of memory.

Are you aware of any x86 chips that utilize this method?


Not that I am aware. I remember seeing apple doing something it in software with the intel macs. Which is why I speculated about it being hardware for M1.

Cheers


> Blosc [...] has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations (which is typical in vector-vector operations).

https://blosc.org/pages/blosc-in-depth/




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: