The "pipeline bubbles" remark refers to the decoding unit of a processor needing...

The "pipeline bubbles" remark refers to the decoding unit of a processor needing to insert no-ops into the stream of a processing unit while it waits for some other value to become available (another processing unit is using it). For example, say you need to release some memory in a GC language, you would just drop the reference while the pipeline runs at full speed (leave it for the garbage collector to figure out). In an refcount situation, you need to decrease the refcount. Since more than one processing unit might be incrementing and decrementing this refcount at the same time, this can lead to a hot spot in memory where one processing unit has to bubble for a number of clock cycles until the other has finished updating it. If each refcount modify takes 8 clock cycles, then refcounting can never update the same value at more than once per 8 cycles. In extreme situations, the decoder might bubble all processing units except one while that refcount is updated.

For the last few decades the industry has generally believed that GC lets code run faster, although it has drawbacks in terms of being wasteful with memory and unsuitable for hard-realtime code. Refcounting has been thought inferior, although it hasn't stopped the Python folks and others from being successful with it. It sounds like Apple uses refcounting as well and has found a way to improve refcounting speed, which usually means some sort of specific silicon improvement.

I'd speculate that moving system memory on-chip wasn't just for fewer chips, but also for decreasing memory latency. Decreasing memory latency by having a cpu cache is good, but making all of ram have less latency is arguably better. They may have solved refcounting hot spots by lowering latency for all of ram.

From Apple's site:

"M1 also features our unified memory architecture, or UMA. M1 unifies its high-bandwidth, low-latency memory into a single pool within a custom package. As a result, all of the technologies in the SoC can access the same data without copying it between multiple pools of memory." That is paired with a diagram that shows the cache hanging off the fabric, not the CPU.

That says to me that, similar to how traditionally the cpu and graphics card could access main memory, now they have turned the cache from a cpu-only resource into a shared resource just like main memory. I wonder if the GPU can now update refcounts directly in the cache? Is that a thing that would be useful?