I think the JVM TCP stack is extremely mature, but I'm really surprised to see t...

vidarh · on Aug 6, 2013

If the JVM and/or the Scala runtime does user-space buffering and Go forwards straight to the read/write syscalls, that alone would probably sufficient to explain the difference when reading/writing buffers this small.

If you don't do buffering in user space, the context switches in/out of the kernel will kill you when you do many small read or writes.

No idea if that's it, but that's the first place I tend to look when troubleshooting networking app performance as so many people just blindly use syscalls as if they're free.

jongraehl · on Aug 6, 2013

The better performance of the Scala client+server if anything suggests less buffering, not more, since the next ping can't be written until the previous pong is received.

p.s. to parent: "the JVM TCP stack?" really?

vidarh · on Aug 6, 2013

I admit I haven't checked the example thoroughly - if it goes in lockstep then buffering won't be the culprit.

But you're wrong that better performance implies less buffering. A typical way to write such applications to do less buffering is to do select() or poll() or equivalent followed by a large non-blocking read, and then picking your data out of that buffer.

As pointed out above, if this "benchmark" does ping/pong's in lockstep across a single connection, then buffering vs no-buffering will make exactly no difference, as there's no additional data available to read. But in scenarios where the amount of data is larger, the time saved from fewer context switches quickly adds up and gives you far more time to actually process the data. Usually your throughput will increase, but your latency will also tend to drop despite the buffering as long as the buffers are of a reasonable size.

Buffering is a problem when the buffers grows to large multiples of the typical transaction size, but for typical application level protocols that takes really huge buffers.

jongraehl · on Aug 6, 2013

My comment was specific to this benchmark, which has 4 byte messages that cannot be buffered due to the ping (wait for pong) ping ... repetition in each client. Of course buffering matters for full-pipeline throughput.

skrebbel · on Aug 6, 2013

Your comment makes me wonder whether it matters on which JVM the benchmark is run. Are Oracle, GNU Classpath and Dalvik equally fast?

wyuenho · on Aug 6, 2013

With Go using 20 times less memory, I'd take Go any day since I can just throw in 20 more processes and up my throughput 5-10x. But of course, scaling like this takes a bit of design and the processes have to share nothing.

shin_lao · on Aug 6, 2013

I would take the memory measurement with a grain of salt. A 200 MiB large JVM doesn't mean all the 200 MiB are used. They could have been reserved (preemptive memory allocation).

levosmetalo · on Aug 6, 2013

I agree. I've seen so many times clueless people complain about JVM memory footprint for tiny "benchmarks". In practice, it's never an issue for long running web applications. That, and people that don't understand how JIT works and why it has start-up penalty.

chetanahuja · on Aug 6, 2013

In practice, it's never an issue for long running web applications.

You might want to clarify that statement a bit. It almost sounds like you are implying that memory pressure is never an issue for long running web applications in java. Did you mean to say something else?

levosmetalo · on Aug 6, 2013

The initial amount of memory required by VM, while being significant for small command line utilities is only a fraction of total memory required by an application. In a web application tuned for performance a lot of the memory will be used for caching anyways. Also, I'd be glad to pay a small memory hit upfront if that means that I will get a top quality GC and very low probability of memory leaks in the long run.

As for the startup time, MojoJolo summed it up.

chetanahuja · on Aug 6, 2013

I note that the memory pressure question got swept under the rug there :-)

That's ok. It's not a revelation to anybody here (I hope) the enormous cost in memory overhead you have to pay for acceptable performance from the JVM. That "top quality GC" basically requires a 2X overhead (on top of your actual cache payload) to perform with reliable low latency and high throughput.

fauigerzigerk · on Aug 6, 2013

I agree completely. And in spite of that "top quality" GC and all the tuning in the world you're still running the risk of having the world stop on you for tens of seconds on larger systems.

The JVM (at least OpenJDK, probably not Azul) is quickly becoming untenable as server memory goes into the hundereds of GBs. I'm reluctantly moving back to C++ for that reason alone.

virmundi · on Aug 6, 2013

How do you get around heap fragmentation? I know that the JVM (Oracle I believe) is really limited to about 32 GB of RAM before it has real issues. But the nice thing is that the GC will compact the heap for better future performance.

As a possible work around to the JVM limit, a distributed architecture with N JVMs running a portion of the task could solve the small memory space with minimal system overhead. What I mean by this let's say you need to have 64 GB of memory for your app. Given the comment above, Java would not do well with this. But you could have 4 16 GB VMs each handling 1/4 of the work. The GC would prevent fragmentation that you'd see in long running C++ apps and still provide you with operational capacity.

fauigerzigerk · on Aug 7, 2013

Heap fragmentation hasn't been a big problem for me. Using multiple JVMs means to reimplement all data structures in shared memory and create my own memory allocator or garbage collector for that memory. It's a huge effort.

Many applications can distribute work among multiple processes because they don't need access to shared data or can use a database for that purpose. But for what I'm doing (in-memory analytics) that's not an option.

virmundi · on Aug 7, 2013

You've probably since moved on from this converstation, but I wonder if Tuple Space might help [1]. It provides a distributed memory feel to applications. Apache River provides one such implementation.

Another question about in-memory analytics is do you have to be in-memory? I'm currently working on an analytics project using Hadoop. With the help of Cascading [3] we're able to abstract the MR paradigm a lot. As a result we're doing analytics across 50 TB of data everyday once you count workspace data duplication.

1 - https://en.wikipedia.org/wiki/Tuple_space 2 - http://river.apache.org/index.html 3 - http://cascading.org

fauigerzigerk · on Aug 7, 2013

Thanks for the links. The reason why we decided to go with an in-memory architecture for this project is that we have (soft) realtime requirements and complex custom data structures. Users are interactively manipulating a medium size (hundereds of gigs) dataset that needs to be up-to-date at all times.

The obvious alternative would be to go with a traditional relational database, but my thinking is that the dataset is small enough to do everything in memory and avoid all serialization/copying to/from a database, cache or message queue. Tuple Spaces, as I understand it, is basically a hybrid of all those things.

seunosewa · on Aug 6, 2013

For server programs that require a lot of RAM, why not just use a concurrent and non-blocking garbage collector, or multiple JVM instances, or find ways to reduce GC pressure?

fauigerzigerk · on Aug 7, 2013

I don't have access to a pauseless garbage collector (Azul costs tons of money) and reimplementing all data structures in shared memory is unproductive.

kasey_junk · on Aug 6, 2013

This is absolutely provably false. Anyone who has spent any time doing low latency systems in any language, knows that it needs to be allocation and lock free.

Regardless of whether it is C, C++, or a JVM language you are going to be reusing data structures, directly accessing memory, and in the case of JVM systems using off heap memory. If you are doing this correctly your JVM can be quite small and never GC (or more usually, GC right after initialization/configuration).

MojoJolo · on Aug 6, 2013

@chetanahuja, I think he means that long running web apps was just initialized once and will continue running until they are stopped or restarted. So even though starting up JVM may takes time, that overhead was only done once.

wyuenho · on Aug 6, 2013

It doesn't matter whether they are used or not by the JVM, to the OS, they are allocated and gone. Startup time is a moot point in long running apps.

jongraehl · on Aug 6, 2013

Allocated memory that's paged out and not used is nearly irrelevant (I believe this is true even though a 200mb JVM heap will likely see some unnecessary rotation in/out due to GC - ideally you'd manually configure the heap appropriately (smaller) if you intended to challenge total system memory w/ many processes).

trailfox · on Aug 6, 2013

With Go using 20 times less memory, I'd take Go any day since I can just throw in 20 more processes and up my throughput 5-10x. But of course, scaling like this takes a bit of design and the processes have to share nothing.

That doesn't even make sense. The Scala version can simply spin up more threads. It's not an issue of parallelism, the Scala version just happens to be faster, throwing more Go processes at the problem won't help. Go is already multithreaded. You don't need multiple processes to use the cores on the box. The JVM scales memory usage by default (heap size etc.) based on the amount of memory on the host. If you're going to worry about 100 MB RAM it's easy to constrain the JVM with -Xmx