I prefer an expressive API to a smart one. Which is to say that I wouldn’t mind specifying the expected “history” up front when allocating.
I’ve been thinking of writing a library for “roles”, which would let you describe how a value is expected to change over time, and thereby encode some of the intent about what a variable or bit of code is for, with the happy side effect of helping the runtime system select fast paths.
For example, the “accumulator” role would be for values that are expected to have a property that grows monotonically, like the size of a vector or the value of a counter. For a vector, you might additionally encode whether its growth should be linear or exponential, what its minimum and maximum sizes should be, and various growth factors.
Violating a role would not be an error, though it might be slow, and in debug mode you might get a diagnostic message to help you adjust the roles in your program.
Some role-like features are already present in languages, libraries, and hardware, such as “restrict” to encode the expectation of non-aliased buffers, “__builtin_expect” to encode likelihood, and “__builtin_prefetch” to encode the intent to access a value soon.
Yep, I think this is the correct answer for the general case. With structures close to or larger than a page and on 64 bit systems you can go nuts with overallocation because the virtual address space is much larger than the amount of physical memory.
It's quite possible to go overboard even on a 64-bit system.
For example, I've seen people allocating 4GB buffers for everything, just because that made it impossible to have buffer overflows as long as they indexed with 32-bit integers. But on x86-64 you can only fit 65k such buffers in the address space, and it's not that hard to exhaust that.
I guess my definition of going nuts was a bit more conservative. Both x86-64 and ARM's AArch64 currently only use 48 bits for virtual addressing.
Still, even if you design for computers with 256GB (2^38) physical memory, you still have a virtual address space that's 2^9 times larger (assuming addresses with the MSB bit set are reserved for the kernel's memory space). This is opposed to 32 bit systems where the physical memory space is close to or larger than the virtual address space. E.g. high end smartphones sold in the last few years have 2GB RAM and only 3GB of virtual address space.
Heap allocators already are ridiculously smart. But there's just too many divergent use cases for allocators.
I wrote a simple string heap allocator (grab a 16MB block and hand out 256-byte chunks) and the resulting code was 85x* faster than jemalloc ( http://pastebin.com/dAqa4dbN ) [* I know the way I am benchmarking is shit and probably way off from real world usage, it's hard to do benchmarking ideally though.]
But of course, mine wasn't thread safe, couldn't handle larger blocks (fallthrough to malloc), sucked at fragmentation, and couldn't detect double-frees.
Similarly, I found a massive speed boost for strings by using my own hand-rolled memcpy [[ while(l--) * t++ = * s++; ]] ... undoubtedly because all the trickery in setting up SSE and handling lengths not evenly divisible by the register sizes was more expensive than just dumbly transferring the data, when most strings are very small in size. Though I am sure memcpy would destroy me if I wanted to copy 2GB of data in one go.
All of the extra logic to be smart will help you when it actually works, but will make your worst case substantially more painful.
Let's say you had to create 100 objects. Would you rather create all 100 objects in 10ns each, or 95 objects in 1ns each, and the remaining 5 objects in 1000ns each? What if you were writing a particularly weird app that ended up needing 1000ns for every alloc it did?
One immediate concern I'd have with a smart exponential allocator was when you were memory constrained and your application was a web server or SQL database.
Back when we used 32 bit machines, the exponential container would kill us. Couldnt use all of the memory without running out of ram. When ur at 1gb of space do you really want to double that on insertion? Yeah, yeah demand paging, doesn't work in practice. My instinct is that u want a doubling allocator with a max, some sigmoid that responds to free space and allocation rates.
For FF, they might be better off just making a couple memory pools per page and bump allocating. Throw the pool away on exit and be able to migrate a long running tab into a dynamically allocated pool.
I suspect a large portion of your performance improvement was just from code inlining. Your simple allocator was probably inlined, whereas malloc requires a function call (as the code lives in a shared library).