This just shows how much a leg up garbage collected languages have when it comes...

This just shows how much a leg up garbage collected languages have when it comes to standalone lock-free structures.

Sure, this is technically a wait-free hashmap implemented in C++, which is pretty cool... but it makes two atomic operations on every read. Pure reader threads accessing the same element will contend over the same cache line. One of the very desirable aspects of a lock-free concurrent hashmap is cheap reads, since read-only scenarios are so common. The proof is in the pudding: as noted at the end this map is 200% slower than a lock-based implementation for a read-heavy load, which is exactly a type of load where the lock-free map should be able to exploit full concurrency.

Java has had zero-atomic ConcurrentHashMap reads since inception, and they are fast (a couple nanos or so). I'm sure other garbage-collected languages are the same. The big thing garbage collection gives you is automatically solving the memory reclamation problem: it's no coincidence that problem is highlighted in the article and that the atomic-ops-on-read comes directly from it. Garbage collection avoids it entirely, because you never explicitly delete the removed nodes, leaving it up to the garbage collector which can use magic beyond the read of the mere mortal to free them safely at a later time.

To get a concurrent[1] hashmap with reasonable read performance, i.e., "solving" the I am only aware of the following techniques:

1) Something like quiescent-state based reclamation: basically each thread occasionally signals when it is definitely not referencing any map nodes, at which point (or some later point) the implementation can determine it is safe to clean them up. Effectively it allows many fast read calls by batching up the "ok to reclaim" notification into a single occasional quiescent checkpoint call. It introduces some delay in reclamation, and you have to have a suitable place to put the quiescent checkpoint. If it works, I think this is perhaps the fastest. See for example https://github.com/rmind/libqsbr .

2) Use some kind of dekker-like flagging to coordinate between readers and deleters. E.g., a reader sets a going_to_read_flag and a deleter sets a going_to_delete flag and then each side checks the other flag and proceeds only if it is not set. This doesn't work with "plain" writes and reads on almost any arch because nearly all (including x86) allow store-load reordering, so you need a barrier. The good news is that you can move all the cost of the barrier to the writers with an async membarrier like http://man7.org/linux/man-pages/man2/membarrier.2.html . The cost for the deleters is very high though (a syscall at least), so you have to have a heavy read load to make this work. It's not as fast as QSBR since you still need to do some extra (non-atomic) ops on the read side and you have to figure out how to handle multiple reads).

3) Never actually free the memory. Rather than using plain malloc/free for nodes, just keep recycling the same nodes over and over, so once a reader gets a node, it is always "safe" to access it. Of course, this means that nodes aren't immutable, so readers may see a different value when a node is recycled. That's probably fine since the implementation can just make a copy (return "by value") - but you have to handle reads where a node is part way through the transition, to avoid returning a split read. If your data is small (8 bytes or less) that's easy since plain reads are atomic at that level. For larger values you could use a seqlock.

4) Transactional memory stuff, e.g., Intel HTM.

Has any seen any other approaches out there for native code?

The complexity and tradeoffs involved are why I don't think we'll be seeing a concurrent map in standard C++ any time soon.

---

[1] Not all of these may be strictly lock-free, certainly not wait-free - but an approximation that works well in practice is often much better than the 100% authentic read-deal when it comes to this kind of thing. That compromise is why you hear much more about "lock-free" than "wait-free" even though the latter is strictly better in a theoretical sense and Herlihy proved that wait-free algorithms generally exist when lock-free ones too (and indeed provided a recipe for changing the latter into the former).