I don't think that's quite right. Apple believes strongly in retain-and-release / ARC. It has designed its software that way; it has designed its M1 memory architecture that way. The harmony between those design considerations leads to efficiency: the software does things in the best way possible, given the memory architecture.
I'm not an EE expert and I haven't torn apart an M1, but Occams's Razor would suggest it's unlikely they made specialized hardware for NSObjects specifically. Other ARC systems on the same hardware would likely see similar benefits.
I suspect that Apple didn't anything special to improve performance of reference counting apart from not using x86. Simply put x86 ISA and memory model is built on assumption that atomic operations are mostly used as part of some kind of higher-level synchronization primitive and not for their direct result.
One thing is that M1 has incredible memory BW and is implemented on single piece of silicon (which certainly helps with low-overhead cache consistency). Another thing is that rosetta certainly does not have to preserve exact behavior of x86 (and in fact it cannot because doing so will negate any benefits of dynamic translation) it only has to care about what can be observed by the user code running under it.
The hardware makes uncontended atomics very fast, and Objective-C is a heavy user of those. But it would really help any application that could use them, too.
But GCd languages don't need to hit atomic ops constantly in the way ref-counted Objective C does, so making them faster (though still not as fast as regular non-atomic ops) is only reducing the perf bleeding from the decision to use RC in the first place. Indeed GC is generally your best choice for anything where performance matters a lot and RAM isn't super tight, like on servers.
Kotlin/Native lets us do this comparison somewhat directly. The current and initial versions used reference counting for memory management. K/N binaries were far, far slower than the equivalent Kotlin programs running on the JVM and the developer had to deal with the hassle of RC (e.g. manually breaking cycles). They're now switching to GC.
The notion that GC is less memory efficient than RC is also a canard. In both schemes your objects have a mark word of overhead. What does happen though, is GC lets you delay the work to deallocate from memory until you really need it. A lot of people find this quite confusing. They run an app on a machine with plenty of free RAM, and observe that it uses way more memory than it "should" be using. So they assume the language or runtime is really inefficient, when in reality what's happened is that the runtime either didn't collect at all, or it collected but didn't bother giving the RAM back to the OS on the assumption it's going to need it again soon and hey, the OS doesn't seem to be under memory pressure.
These days on the JVM you can fix that by using the latest versions. The runtime will collect and release when the app is idle.
I'm not an EE expert and I haven't torn apart an M1, but Occams's Razor would suggest it's unlikely they made specialized hardware for NSObjects specifically. Other ARC systems on the same hardware would likely see similar benefits.