L1 cache lines are typically 64 bytes long, and you're only reading 16 bytes of it per level of the tree. And a random pointer tree traversal is the worst case for CPU branch prediction, too.
There are methods for optimising the use of various tiers of cache, trying to pack multiple nodes into one 64 byte line, arranging nodes such that left and right children are usually adjacent in memory so prediction losses are minimised, and other tricks.
There are methods for optimising the use of various tiers of cache, trying to pack multiple nodes into one 64 byte line, arranging nodes such that left and right children are usually adjacent in memory so prediction losses are minimised, and other tricks.