You can bind processing for a consistent subset of requests to an individual cpu core - essentially sharding requests across cpu cores and benefitting from very high l1 and l2 cache utilization. The idea is to treat the system with multiple cores as bunch of single cpu nodes connected via bus instead of network and without the unnecessary overhead of thread and process related context swtiching.
Instead of sharing the code at runtime, i.e. what an OS does. You could easily share code at compile time, i.e. statically link a library.
Because of sharable code, "implementing * in application" should always be at-least as performant as the best generic implementation (i.e. the implementation you find in a general purpose OS). However, when appropriate, customizing the implementation for the application would allow it to become even more performant.