It doesn't return the exact same result, but since you're not verifying the results, it is effectively the same (4 bytes in, 4 bytes back out). I did slightly better with a hand-crafted one.
I made a small change to the way the client is working, buffering reads and writes independently (can be observed later) and I get similar numbers (dropped my local runs from ~12 to .038). This is that version: http://play.golang.org/p/8fR6-y6EBy
Now, I don't know scala, but based on the constraints of the program, these actually all do the same thing. They time how long it takes to write 4 bytes * N and read 4 bytes * N. (my version adds error checking). The go version is reporting a bit more latency going in and out of the stack for individual syscalls.
I suspect the scala version isn't even making those, as it likely doesn't need to observe the answers.
You just get more options in a lower level language.
I think you're on the right track in supposing that there can't be a huge performance difference in such a simple task, given that both languages are compiled and reasonably low-level. The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT. Your suggestion to try server-{a,b} x client-{a,b} is also a good one.
Your modified Go server doesn't return "Pong" for "Ping". It returns "Ping". And the "a small change" version is nonsense. It's fundamentally different. - you're firing off all your requests before waiting for any replies, and so hiding the latency in the more common RPC style request-response chain, which is a real problem.
You speculate a lot ("hiding some magic" "likely doesn't need to observe the answers") when you haven't offered any insight.
EDIT: Nagle doesn't matter here - it doesn't delay any writes once you read (waiting server response). It only affects 2+ consecutive small writes (here I'm trusting http://en.wikipedia.org/wiki/Nagle's_algorithm - my own recollection was fuzzy). If Go sleeps client threads between the ping and the read-response call then I suppose it would matter (but only a little? and other comments say that Go defaults to no Nagle alg. anyway).
> The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT.
Really, the most plausible explanation? I'd say the most plausible explanation is that M:N scheduling has always been bad at latency and fair scheduling. That's why everybody else abandoned it when that matters. It's basically only good for when fair and efficient scheduling doesn't matter, like maths for instance, which is why it's still used in Haskell and Rust. I wouldn't be surprised to see Rust at least abandon M:N soon though once they start really optimizing performance.
Interestingly, both the go client and the scala client perform the same speed when talking to the scala server (~3.3s total), but the scala client performs much faster when talking to the go server (~1.9s total), whereas the go client performs much worse (~23s total, ~15s with GC disabled).
I thought the difference might partly be in socket buffering on the client, so I printed the size of the send and receive buffers on the socket in the scala client, and set them the same on the socket in the go client. This didn't actually bring the time down. Huh.
My next thought was that scala is somehow being more parallel when it evaluates the futures in Await.result. Running `tcpdump -i lo tcp port 1201` seems to confirm this. The scala client has a lot more parallelism (judging by packet sequence ids). Is that really because go's internal scheduling of goroutines is causing lock contention or lots of context switching?
> Current goroutine scheduler limits scalability of concurrent programs written in Go, in particular, high-throughput servers and parallel computational programs. Vtocc server maxes out at 70% CPU on 8-core box, while profile shows 14% is spent in runtime.futex(). In general, the scheduler may inhibit users from using idiomatic fine-grained concurrency where performance is critical.
Bear in mind that was written before Go 1.1, additionally Dimitry has made steps to address CPU underutilization and has been working with the rest of the Go team on preemption. I think these improvements will make it into Go 1.2, fingers crossed.
Best response here. I spent weeks trying to get a go OpenFlow controller on par with Floodlight (java). I finally gave up on tcp performance and moved on when I realized scheduling was the problem.
Interesting, but now I'm even more confused. How can we possibly explain that a (go client -> go server) (which are in separate go processes) performs far worse than (go -> scala server), given that the go server seems to be better when using the scala client?
The comments on the article page have a different report which doesn't suffer from this implausibility:
> Interesting, but now I'm even more confused. How can we possibly explain that a (go client -> go server) (which are in separate go processes) performs far worse than (go -> scala server), given that the go server seems to be better when using the scala client?
I've been curious about that as well. The major slowdown seems to be related to a specific combination of go server and client. I don't have a good explanation. I'd love to hear from someone familiar with go internals.
> go server + go client 22.02125152
> ...
> scala server + go client 4.766823392
I'm curious: are you saying Go is M:N and JVM is not? I had to look up M:N - http://en.wikipedia.org/wiki/Thread_(computing)#M:N_.28Hybri... - but ultimately I don't know anything about JVM or Go threading, and your comment didn't go enough into detail for me to follow your reasoning.
Yes I forget the audience. Go uses M:N scheduling meaning that the OS has M threads and Go multiplexes N of its own threads on top of these. The JVM uses N:1 like basically every other program where the kernel does all scheduling.
The basic problem with M:N scheduling is that the OS and program work against each other because they have imperfect information, causing inefficiencies.
Yes, but can Go actually use anything else? Finely-grained concurrency after the CSP fashion, after all, is the whole driving force behind it, and it's in the language spec.
Are hybrid approaches worth it (exposing some details so that Go network server can get the right service from the OS)? I'm not sure how much language complexity Go-nuts will take, so they'll probably look for clever heuristic tweaks instead.
You can turn off M:N on a per-thread (really per-thread-group) basis in Rust and we've been doing that for a while in parts of Servo. For example, the script/layout threads really want to be a separate thread from GL compositing.
Userland scheduling is still nice for optimizing synchronous RPC-style message sends so that they can switch directly to the target task without a trip through the scheduler. It's also nice when you want to implement work stealing.
Can you just have 1 thread per running task and give the thread back to a pool when the task waits for messages? Then for synchronous RPC you can swap the server task onto the current thread without OS scheduling and swap it back when it's done. You just need a combined 'send response and get next message' operation so the server and client can be swapped back again. This seems way easier and more robust, and you don't need work stealing since each running task has its own thread... what am I missing?
It doesn't work if you want to optimistically switch to the receiving task, but keep the sending task around with some work that it might like to do if other CPUs become idle. (For example, we've thought about scheduling JS GCs this way while JS is blocked on layout.)
Is the OS not scheduling M runnable threads on N cores? Blocking/non-blocking is just an API distinction, and languages implement one in terms of the other.
They are threads. Technically they are "green threads". The runtime does not map them to OS threads, although technically if it chose to it could, because goroutines are abstract things and the mapping to real threads is a platform decision.
Buffering impacts performance when it transforms many small writes into one big write (same for reads). In that case since you are waiting for the answer at every iteration, I'm not sure I see how it could have an impact.
> Your modified Go server doesn't return "Pong" for "Ping".
The program doesn't read the result, so it doesn't matter. Returning Pong isn't harder, but why write all that code if it's going to be ignored anyway?
> It's fundamentally different. - you're firing off all your requests before waiting for any replies, and so hiding the latency in the more common RPC style request-response chain, which is a real problem.
As I said, the program isn't correlating the responses with the requests in the first place -- or even validating it got one. I don't know scala, but I've done enough benchmarking to watch even less sophisticated compilers do weird things with ignored values.
I made a small change that produced semantically the same program (same validation, etc...). It had similar performance to the scala one. If you don't think that's helpful, then add further constraints.
Compilers do not restructure a causal chain of events between a client and server in a different process. It's very easy to understand this when you realize that send -> wait for response and read it will result in certain system calls, no matter the language.
[Send 4 bytes * 200, then (round trip latency later) receive 4 bytes * 200] is fundamentally different than [(send 4 bytes, then (round trip latency later) receive 4 bytes) * 200]. Whether the message content is "ignored" is irrelevant.
Or, put another way, it's ridiculous for you to modify the Go program in that way (which will very likely send and receive only a single TCP segment over the localhost "network") and report the faster time as if it means anything. If you modify both programs in that way, fine. But it's something completely different.
> The experiments where performed on a 2.7Ghz quad core MacBook Pro with both client and server running locally, so as to better measure pure processing overhead. The client would make 100 concurrent connections and send a total of 1 million pings to the server, evenly distributed over the connections. We measured the average round trip time.
Another let's rape localhost:8080 on a MacBook Pro™ benchmark
GP:
Please use the word rape to the full extent provided by the english language, and let those who don't like the word deal with it.
As for the benchmark, who writes their own load balancer? Isn't this generally a solved problem? If the point is to extract max performance from a simple ping-pong server then I'd go right to C and epoll/libevent/etc. I'm guessing that the team is somehow trying to extrapolate data from a ping-pong server to the actual problem they're trying to solve which is dare I say, stupid.
In general the best way to solve this problem is use whatever language the team / person writing the software likes because having them write it faster generally outweighs whatever server costs one will run into, whenever this is not the case the best answer is 99% of the time: write it in C.
It doesn't matter whether there are other meanings. The fact remains that there is a single meaning which is understandably deeply upsetting to some people. I'd imagine you'd avoid topics that upset your friends in real life, and some of us like to extend this courtesy to strangers on the internet as well. If you feel your "full extent of the English language" is more important than the emotions of strangers, that's your choice, but forgive me and others for judging you as an asshole.
>It doesn't matter whether there are other meanings. The fact remains that there is a single meaning which is understandably deeply upsetting to some people.
People getting upset by mere words (not even uttered against them) are hardly worth a hackers time.
Also notice how you, the oh-so-sensitive to the "emotions of strangers" and the "meanings that upset people", called him "an asshole" (for merely suggesting the use of a word). Way to go for tolerance.
People can get upset by words for a variety of reasons. Should they make an effort to handle their emotions better? Absolutely. But that doesn't mean you shouldn't also make an effort avoid language which might upset people.
And perhaps I'm being hypocritical by insulting people, but I get angry when I see people with this ignorant "Not my problem" attitude to offensive language, and communication in general.
Communication is a two way street. If your messages are not being received as you'd like, is it so absurd to suggest that you consider changing what you say before criticising how others listen?
>People can get upset by words for a variety of reasons. Should they make an effort to handle their emotions better? Absolutely. But that doesn't mean you shouldn't also make an effort avoid language which might upset people.
Well, I'll take any effort needed to avoid language that might upset people -- except the kind of people that are upset by language.
Those I'll take any effort needed to upset them.
I mean, I would not use words that might hurt a regular person or somebody who is actually sensitive to them because of their state/past/gender/color/etc.
But I would very much use all the words that annoy the kind of people who get annoyed by words all the time, ie. the PC crowd. I'm with this Carlin and Lenny Bruce guy on this one.
Obviously you have to draw the line somewhere, but if you don't think there are people who are "actually sensitive" to this particular use of language, then you are sorely mistaken.
> People getting upset by mere words (not even uttered against them) are hardly worth a hackers time.
Rape is a trigger word unlike most other words.
I for one don't sensor myself around my friends, I tend to cross the line, all the time. My friends see me in context smiling or laughing, I know most of their life stories. It's safe to push the limits of hyperbole.
Here we have little/no context. Someone sees the word and they think of something horrible, the place, the person, the smells. All the things that are blocked out on and every day basis.
So why not avoid the word? Just to be kind to a stranger.
People getting upset by mere words (not even uttered against them) are hardly worth a hackers time.
Get over yourself. Taking on the title of hacker isn't some prestigious achievement, it's a self-aggrandizing social signal for tech hipsters. Beyond that, every caliber of individual can be upset by 'mere words', including yourself.
>Taking on the title of hacker isn't some prestigious achievement, it's a self-aggrandizing social signal for tech hipsters.
This forum is called "Hacker's news" for a reason. And that reason predates "tech hipsters" by 40+ years. It's not an achievement, I'll give you that. But it IS a culture, and that culture doesn't take self-censorship and puritan values very well...
>Beyond that, every caliber of individual can be upset by 'mere words', including yourself.
Being upset when some words are targeted at you or at people you do not think deserve such treatment is normal. It's being upset just because of the use of words that's prudish and bad.
"that reason predates "tech hipsters" by 40+ years"
"it IS a culture, and that culture doesn't take self-censorship and puritan values very well..."
These days, the title of 'hacker' is akin to the title 'patriot': everybody knows what a real one looks like and they're all too happy to monkey patch their own arbitrary components into the definition. Last I heard, there is no general stance regarding self-censorship in the hacker community.
Also, I think its an impressive type conversion for you to cast what is commonly described as overzealous liberalism to puritanical religiosity. There is nothing puritanical about respecting the sensitivities of sexual assault victims.
>Also, I think its an impressive type conversion for you to cast what is commonly described as overzealous liberalism to puritanical religiosity.
Well, I don't consider it that impressive.
Political correctness is just one method the liberals found to maintain the puritanical religiosity of their past. Just the secullar side to the same coin.
You cannot get puritanism out that easily, you just divert it from religious thinking to other endeavours.
We have the same kind of conversions in Europe too -- not to mention that it's a kind of well discussed topic in literature and psychology.
When those words can cause traumatic flashbacks for certain people, I think it is absolutely justified to be upset by their usage, regardless of how they are "targeted". This is not about "puritan values" it's about making an effort understanding the emotions of trauma victims.
Are you saying that coldtea is a tech hipster desperately signalling for the purpose of self-aggrandizement? How kind. Is a direct insult to a real person more civil than a fantasy assault on an inanimate object? Because I'm not sure of the rules.
Dismissing people as hardly worth a hacker's time is pretty self-aggrandizing, yes. Invoking a nebulous definition of hacker as a guideline for appropriate behavior is about as hip as it gets; following a trend for the sake of its title.
Let me get this straight. So he's an "asshole" because you and possibly some other people refuse to accept that certain English words have multiple distinct definitions, and because you also refuse to take into account context when reading such words? It's a rather unusual stance to take.
I'm not refusing to accept anything. Of course words have multiple meanings, but that doesn't mean the other meanings magically disappear just because they're not implied by context.
Interpretation of natural language is strongly influenced by lexical connotations. If you use a negative word in your sentence, people will have negative reactions to it, regardless of your intent.
In practice most people who do this are merely profoundly ignorant and unsympathetic, rather than overtly being assholes. But yes, if someone intentionally choose to use words with upsetting connotations, I consider them an asshole.
> As for the benchmark, who writes their own load balancer? Isn't this generally a solved problem?
You'd be surprised, but no, it isn't a solved problem because the problem can't be solved in a general way and a lot of people are writing their own load balancers.
> If the point is to extract max performance from a simple ping-pong server then I'd go right to C and epoll/libevent/etc.
For a ping-pong server, sure, but a ping-pong server has no value in real life. The point of building your own load balancer is for doing custom load balancing and routing, based on individual needs of the project.
The complexity grows exponentially and while big companies are building such things in C, mere mortals do not have the resources for it. And in fact building on top of libevent is not going to give you performance advantages over Java NIO. The only thing that sucks about building on top of the JVM (compared to C) is the garbage collector, that can lead to unpredictable latencies related to blocking the whole process during the marking phase, even with CMS, although it is manageable and much better productivity-wise than to deal with GLibc incompatibilities or with multi-threading in C, plus I hear that G1 from JDK7 for large heaps and CPUs with multiple cores, or Azul's pauseless GC are awesome.
There's always a tradeoff of course, but sometimes the best path for many projects is the solution that makes the best compromises between productivity and performance and that's why I'm a happy Scala user and love the JVM.
I'm always amused by the slight fear that web developers treat C with (and even more so by comments like "does anyone actually write anything in C/C++ any more?"[1]), although I do appreciate the compliment that us embedded software engineers aren't "mere mortals" ;-)
Sure, even for people who are experienced with it it takes a bit longer to write robust code in C than it does in a higher level language, but the performance benefits can be enormous. If you've got a well defined block of functionality that needs to be quick, you might well be better off taking the time to do a C implementation than you would be spending ages optimising a Scala implementation.
Of course I spend a lot of my time arguing the opposite at work - we write a lot of C++ out of habit, but it's not performance critical at all and it's running on Linux on an Intel Atom. We'd be a lot more productive writing the app in (for example) Python.
[1] Short version, yes, lot's more than in any single language you use (cheating slightly here by counting C and C++ together). Go a few levels down in the stack of whatever you're writing and you'll get to C/C++ probably quite quickly, or worst case at the OS (and if you don't have an OS, then you're writing embedded code, and you started in C).
I'm not a web developer and I'm not scared of C, but I very frequently find that optimizing the JVM is much easier than writing a similar piece of code in C for my real world problems.
The short answer for me is all the great tooling that comes with a JVM solution as well as the simplicity of deployment. Getting a C/C++ native solution to play nice with a continuous integration environment is often very painful (especially compared to a system that is JVM only). Unit testing tools are better for the JVM, IDE/Debugger/Profiling support is better, etc.
All of these things (and the fact that good JVM code can be very close to C/C++ performance) means I don't roll with C because it's scary, rather because it's a pain.
> I do appreciate the compliment that us embedded software engineers aren't "mere mortals" ;-)
With all due respect to embedded software, the stuff that you have to deal with is limited in scope. I'm not saying it isn't hard or challenging, but nonetheless it is limited in scope. And this matters - I also worked a lot with C/C++ and while C/C++ developers can easily reason about things like cache locality or branch predictions or data structures, if you want to block them at an interview the easiest thing to do is to ask them to do some strings processing. The point of having high-level abstractions is to build bigger, more complex things and with C/C++ the pain starts straight from the get-go.
> even for people who are experienced with it it takes a bit longer to write robust code in C than it does in a higher level language, but the performance benefits can be enormous
Err, no, not really. Tackling multi-threading issues in C is absolutely horrible, coupled with the fact that C/C++ do not have their own memory model, so you end up with really hard to reproduce bugs by just updating some packages on the host OS. Libevent has always been a joy to work with in the context of multi-threading, the result being a whole generation of insecure and freakishly sensitive servers.
On top of the JVM you've got a good memory model, you've got multi-threading done right, you've got async I/O without headaches and as far as the concurrency model is concerned, you can take your pick from light-weight actors, futures/promises, shared transactional memory, parallel collections, map/reduce and so on, without any of the associated headaches.
There's also the issue of using all the hardware that you have. We are only scratching the surface of what can be done in terms of GPUs, but a high-level language running on top of the JVM or .NET makes it feasible to use the available GPUs or other available hardware resources by using high-level DSLs. The Liszt DSL is a really cool example of what I'm talking about: http://liszt.stanford.edu/
So if you mean "performance" as in measuring the time it takes for a while-loop to finish, then sure C is the best, but go further than that to efficient usage of all available resources and the problem is not really black and white.
> you would be spending ages optimising a Scala implementation
That's not true in my experience. At the very least with the JVM you can quickly and easily attach a profiler to any live production instance. In so far as profiling and debugging tools are concerned, the JVM is the best. And that's where the real difference comes from my experience in the real world - optimizing without real profiling is premature optimization and low level optimizations many times stay in the way of more high-level architectural optimizations with better benefits.
Speaking of spending ages on stuff and the server-side, good luck debugging what happened from a 3 GB core dump generated on a segfault, because somebody though that an Int is actually a Long.
Also, just lust week I tested our servers on top of Amazon's c1.xlarge instances, which have to be 64bits, servers which are normally running on c1.medium which are very cost efficient, but that are better left to run with a 32bits OS. In 15 minutes I was able to basically migrate from 32bits to 64bits. Good luck doing that with C/C++.
Not really. Ravish might. But other than on, say, gamer message boards, 'rape' doesn't have many other meanings than forcible sexual assault. Take a look at the usage in print and find casual uses of 'rape' in that context.
That's right. A lot words have metaphorical uses. Now, would you say 'it lynched that server' or 'it made that server its bitch'? You could, surely. The meaning will be clear. You'll still sound like a complete douchebag. And that's really the point - not getting into some three mile debate about denotation and connotation - it's to not sound like a douchebag.
If you find this "weird", then you have a poor understanding of natural languages. It is well understood that lexical connotations affect people's interpretation of text.
You wouldn't expect people to react positively to violent body language while you deliver a nonviolent sentence, why should you expect them to react positively to you using violent metaphors in a nonviolent sentence?
Context my friend. Why are you offended when someone is clearly not using it in the context you seem to be offended by?. It's intention behind the communication that you should react to. If there is no intention then it's pointless crying over it.
Humans are not emotionless robots who can interpret the intention of a sentence without being influenced by the way in which it is constructed. If you use a word with negative connotations, people will react more negatively to your sentence than if you don't, regardless of your intent. It might be "pointless" but it is a simple reality of human behaviour, and it is even more pointless to complain about people being human.
I thought this. While I don't think that benchmark is entirely worthless, it's hardly conclusive enough to be basing a business decision on:
* What's the concurrency like between two servers? (you need to test both low and high concurrent loads)
* What's the difference between the two in creating new connections and/or threads?
* What's the performance like when you up the number of packets per connection? (a simple ping pong hiding buffering performance etc).
* What's the performance like on the host platforms? (I'm going to assume that they're not the same OS on their Macbooks as they will be on their load balancers).
Of course, arguing a case of using Scala because it's a language they're already proficient in is itself a perfectly valid reason. But my issue is they seemed more interested in picking either language based on the best performance without actually evaluating the performance of either in any detail.
The way I see it, instead of basing a decision on what's currently fashionable, they've gone for some empirical data and have actually conducted an experiment.
Sure this is a fairly simplistic first step, but so should all first steps be as you gain understanding, and it's proved its worth by producing a surprising result in a very simple case, that's worth understanding before moving on to more complicated tests.
What you've got here is engineering, rather than just making it up as you go along.
The only conclusion that can be reached from this is that a certain implementation of a TCP client/server for sending 4-byte payloads over the OS X loopback interface written in Scala is faster than another implementation written in Go. That conclusion in itself is absolutely useless, because nobody would ever use that particular code to perform that particular task. But it's also dangerous since it generalizes the result of this particular test to imply that one language is better than another at all tasks related to pumping bytes over the wire.
And that aside, it still fails as an experiment because the results are neither provable nor repeatable. There is not enough information about the environment used to run the test and there is no detailed test output. We get two numbers ("average round trip" for each implementation), but we can't check that they are correct, and we can't look at the distribution or calculate other, potentially more useful metrics. Even the code provided is different from the code used in the actual test.
So even if Scala is ahead in this (flawed) benchmark, that's not how you write a TCP server in Scala, because you want to do it non-blocking. Not doing it based on asynchronous I/O means that in a real-world scenario the server will choke under the weight of slow connections, not to mention be susceptible to really cheap DoS attacks like Slowloris [1].
Seriously, it goes beyond the underlying I/O API that you're using. If anywhere in the code you're reading from an InputStream or you're writing to an OutputStream that's connected to an open socket, then that's a blocking call that can crush your server. Right now, every Java Servlets container that's not compatible with the latest Servlets API 3.1 can be brought down with Slowloris, even if under the hood they are using NIO.
Option A for writing a server in Scala is Netty [2].
Option B for writing a server in Scala is the new I/O layer in Akka [3].
>The experiments where performed on a 2.7Ghz quad core MacBook Pro with both client and server running locally
No no no. Assuming your production code runs on Linux THAT is where you need to do this test. It is extremely naive to assume that either the JVM or the Go runtime will perform system interfacing tasks even remotely similar between OSX and Linux. Linux is what you will use in production. Linux is almost always faster (more effort from devs both on the kernel TCP/IP side and the runtime/userspace side).
Write your Scala, Java, Go wherever you want, but please, benchmark it in a clone of your production environment!
P.S. In production I assume your client and server will not be local... don't do this, kernels do awesome/dirty optimizations over loop-back interfaces, sometimes even bypassing large parts of the TCP/IP stack, parts you want included in any meaningful benchmark.
Go server vs Go client: 10ms
Scala server vs Scala client: 3ms
Go server vs Scala client: 4ms
Scala server vs Go client: ????
Scala server against the Go client is really slow (?). I reduced the ping count by a factor of 100, and extrapolating I think it would have reported around 670ms. What gives?
I don't know much about Scala Futures, but isn't Scala's client doing something completely different than the Go client? Scala's client with Future.sequence looks like it's calling each `ping` method sequentially.
Printing the connection identifier {0,100} on open & close shows that while it isn't completely sequential, only about a handful of connections are open at a time.
On the other hand, the Go client appears to switch amongst goroutines more frequently. All the connections open before any connection closes.
In other words, the I think the difference in performance is due to the difference in how randomly the connections are shuffled. The terrible performance time in the last case I think shows a bottleneck in the Scala server rather than the Go client.
If the eventual production app would run on Linux (which I'm only guessing based on the context), this benchmark should probably be run there. Darwin's surprisingly higher system call and context switch overhead can be deceptive for apps that are OS-bound.
Quick look at the language benchmark games, and Go is not 10x slower than Java for most tests. Java is often 3-5x slower than C/C++/etc (although maybe it didn't get enough time to warm up).
Well given both C and Go are native compiled languages, I would think C is a realistic (if distant) goal for Go performance, at least for algorithmic stuff where you're not bouncing in and out of the runtime. I was commenting on how far it has to go.
Ok, I see. Technically a JIT compiles to native code too, so for a long running app there shouldn't be much difference. Both Go and Java are garbage collected, but the JVM has the more sophisticated GC.
Ok, so. When you need to write a load balancer and want to test different languages for the task, you don't do a benchmark like that one.
Writing a "ping-pong"? And not using the same client to test both servers?
It would not have been too hard to write a simple proxy in both languages. Not even worrying about parsing HTTP headers, just testing TCP, that is really easy.
Now, if you really want to test the performance, you have to implement it differently. Just two small features you would need in both:
* non blocking IO: right now, you're starting a new future in Scala, that's easy to write but not really efficient (it might work better with goroutines)
* zero copy: if you're load balancing, you will spend your time moving bits, so you'd better make sure that you don't copy them too much. It is possible with Scala, but it looks like Go does not support it
Now, when you have reasonable testing grounds (that woudn't be more that a hundred lines in both languages), better get your statistics right.
"The client would make 100 concurrent connections and send a total of 1 million pings to the server, evenly distributed over the connections. We measured the average round trip time" -> that is NOT how you should test. Here, you would want to know what is the nominal pressure the balancer could handle, so you must measure a lot of metrics:
* RTT
* bandwidth (per user and total)
* time to warmup (the JVM optimizes a lot of things on the fly, you have to wait for it)
* operational limits (what is the maximal bandwidth for which the performance crashes? same for number of users)
And then, you don't measure only the average values. You must measure the standard deviation. Because you could have a good average, but wildly varying data, and that is not good.
Last thing: the macbook may not be a good testing system.
The benefit of go is the cheap threads and CSP which make it scale well for complex servers.
I think we'll see the Go runtime gaining Single System Image distribution, performance improvements and libraries not services (e.g. groupcache) making it a very different world to develop in than Scala.
You can't really come to a conclusion since no information is given about the configuration of the JVM. Most likely the minimum heap size was unset which often means it will default to 1/64 of the total system memory.
>200Mb for such a crappy server with only hundred connections.
Yes, and almost a megabyte of disk space for a "hello world" program. C++ is so inefficient, right?
That's not the way you judge the performance of a program and/or it's memory use -- and neither is comparing the 200MB of the JVM with the 10MB of Go here...
If the JVM and/or the Scala runtime does user-space buffering and Go forwards straight to the read/write syscalls, that alone would probably sufficient to explain the difference when reading/writing buffers this small.
If you don't do buffering in user space, the context switches in/out of the kernel will kill you when you do many small read or writes.
No idea if that's it, but that's the first place I tend to look when troubleshooting networking app performance as so many people just blindly use syscalls as if they're free.
The better performance of the Scala client+server if anything suggests less buffering, not more, since the next ping can't be written until the previous pong is received.
I admit I haven't checked the example thoroughly - if it goes in lockstep then buffering won't be the culprit.
But you're wrong that better performance implies less buffering. A typical way to write such applications to do less buffering is to do select() or poll() or equivalent followed by a large non-blocking read, and then picking your data out of that buffer.
As pointed out above, if this "benchmark" does ping/pong's in lockstep across a single connection, then buffering vs no-buffering will make exactly no difference, as there's no additional data available to read. But in scenarios where the amount of data is larger, the time saved from fewer context switches quickly adds up and gives you far more time to actually process the data. Usually your throughput will increase, but your latency will also tend to drop despite the buffering as long as the buffers are of a reasonable size.
Buffering is a problem when the buffers grows to large multiples of the typical transaction size, but for typical application level protocols that takes really huge buffers.
My comment was specific to this benchmark, which has 4 byte messages that cannot be buffered due to the ping (wait for pong) ping ... repetition in each client. Of course buffering matters for full-pipeline throughput.
With Go using 20 times less memory, I'd take Go any day since I can just throw in 20 more processes and up my throughput 5-10x. But of course, scaling like this takes a bit of design and the processes have to share nothing.
I would take the memory measurement with a grain of salt. A 200 MiB large JVM doesn't mean all the 200 MiB are used. They could have been reserved (preemptive memory allocation).
I agree. I've seen so many times clueless people complain about JVM memory footprint for tiny "benchmarks". In practice, it's never an issue for long running web applications. That, and people that don't understand how JIT works and why it has start-up penalty.
In practice, it's never an issue for long running web applications.
You might want to clarify that statement a bit. It almost sounds like you are implying that memory pressure is never an issue for long running web applications in java. Did you mean to say something else?
The initial amount of memory required by VM, while being significant for small command line utilities is only a fraction of total memory required by an application. In a web application tuned for performance a lot of the memory will be used for caching anyways. Also, I'd be glad to pay a small memory hit upfront if that means that I will get a top quality GC and very low probability of memory leaks in the long run.
I note that the memory pressure question got swept under the rug there :-)
That's ok. It's not a revelation to anybody here (I hope) the enormous cost in memory overhead you have to pay for acceptable performance from the JVM. That "top quality GC" basically requires a 2X overhead (on top of your actual cache payload) to perform with reliable low latency and high throughput.
I agree completely. And in spite of that "top quality" GC and all the tuning in the world you're still running the risk of having the world stop on you for tens of seconds on larger systems.
The JVM (at least OpenJDK, probably not Azul) is quickly becoming untenable as server memory goes into the hundereds of GBs. I'm reluctantly moving back to C++ for that reason alone.
How do you get around heap fragmentation? I know that the JVM (Oracle I believe) is really limited to about 32 GB of RAM before it has real issues. But the nice thing is that the GC will compact the heap for better future performance.
As a possible work around to the JVM limit, a distributed architecture with N JVMs running a portion of the task could solve the small memory space with minimal system overhead. What I mean by this let's say you need to have 64 GB of memory for your app. Given the comment above, Java would not do well with this. But you could have 4 16 GB VMs each handling 1/4 of the work. The GC would prevent fragmentation that you'd see in long running C++ apps and still provide you with operational capacity.
Heap fragmentation hasn't been a big problem for me. Using multiple JVMs means to reimplement all data structures in shared memory and create my own memory allocator or garbage collector for that memory. It's a huge effort.
Many applications can distribute work among multiple processes because they don't need access to shared data or can use a database for that purpose. But for what I'm doing (in-memory analytics) that's not an option.
You've probably since moved on from this converstation, but I wonder if Tuple Space might help [1]. It provides a distributed memory feel to applications. Apache River provides one such implementation.
Another question about in-memory analytics is do you have to be in-memory? I'm currently working on an analytics project using Hadoop. With the help of Cascading [3] we're able to abstract the MR paradigm a lot. As a result we're doing analytics across 50 TB of data everyday once you count workspace data duplication.
Thanks for the links. The reason why we decided to go with an in-memory architecture for this project is that we have (soft) realtime requirements and complex custom data structures. Users are interactively manipulating a medium size (hundereds of gigs) dataset that needs to be up-to-date at all times.
The obvious alternative would be to go with a traditional relational database, but my thinking is that the dataset is small enough to do everything in memory and avoid all serialization/copying to/from a database, cache or message queue. Tuple Spaces, as I understand it, is basically a hybrid of all those things.
For server programs that require a lot of RAM, why not just use a concurrent and non-blocking garbage collector, or multiple JVM instances, or find ways to reduce GC pressure?
I don't have access to a pauseless garbage collector (Azul costs tons of money) and reimplementing all data structures in shared memory is unproductive.
This is absolutely provably false. Anyone who has spent any time doing low latency systems in any language, knows that it needs to be allocation and lock free.
Regardless of whether it is C, C++, or a JVM language you are going to be reusing data structures, directly accessing memory, and in the case of JVM systems using off heap memory. If you are doing this correctly your JVM can be quite small and never GC (or more usually, GC right after initialization/configuration).
@chetanahuja, I think he means that long running web apps was just initialized once and will continue running until they are stopped or restarted. So even though starting up JVM may takes time, that overhead was only done once.
Allocated memory that's paged out and not used is nearly irrelevant (I believe this is true even though a 200mb JVM heap will likely see some unnecessary rotation in/out due to GC - ideally you'd manually configure the heap appropriately (smaller) if you intended to challenge total system memory w/ many processes).
With Go using 20 times less memory, I'd take Go any day since I can just throw in 20 more processes and up my throughput 5-10x. But of course, scaling like this takes a bit of design and the processes have to share nothing.
That doesn't even make sense. The Scala version can simply spin up more threads. It's not an issue of parallelism, the Scala version just happens to be faster, throwing more Go processes at the problem won't help. Go is already multithreaded. You don't need multiple processes to use the cores on the box. The JVM scales memory usage by default (heap size etc.) based on the amount of memory on the host. If you're going to worry about 100 MB RAM it's easy to constrain the JVM with -Xmx
Ugh, after reading all the comments here I wonder how a mere-mortal programmer gets multi-threaded, network programming done right. It's not clear to me if there is a clear winner between Go and Scala/JVM. Are the majority of programs out there crappy, memory-hogging and non-performant?
It's one at a time. But, since the server only performs a one-shot task without the "hang on and wait 30 seconds"-like long connections, and, the default socket backlog of TcpServer is > 100, so every client gets served within the delay of (0~99)*6ms.
In short, every round is fully served, and the concurrency level is 100.
The first one may be your localhost not mapped to 127.0.0.1 ?
The second one may be a problem of backlog size. On my system (ruby 2.1.0dev (2013-08-06) [x86_64-darwin12.4.0]) it's 128 and there's no reset. But you can manual change it with s.listen(1000)
As a word of advice -- you are almost certainly wasting your time writing a load balancer. There are close to zero cases where someone can legitimately justify such an exercise.
In any case, your benchmark is flawed (as virtually all benchmarks are). The reason Go the client is slower is because it tries to pervasively "thread" via the M:N scheduler -- every wait check causes it to yield the actual thread and switch goroutine, creating a large amount of overhead. The Scala cases, on the other hand, is dramatically more limited and will not yield this overhead.
The Go server does not have this fault, and is likely top performance. And aren't we talking about a server anyways?
Now as to the client, while we could naively criticize M:N scheduling based upon this, try giving it a more realistic workload (unless you seriously plan on load balancing pongs): Instead of ping/pong, return larger lengths of data preferably over actual network connections (not localhost) - e.g. 32KB.
The Go client will catch up if not shoot into the lead. M:N scheduling is optimal for most real-world workloads, though it is less optimal for spin-off-a-million-goroutines that do nothing type tests.
This is not a test of TCP overhead, or a realistic test, but instead demonstrates the small overhead of goroutines when you give each a minuscule amount of wait work.
Edit: Go has e.g. net.Dial("tcp", "localhost:1201") and as someone pointed out elsewhere for a more accurate bench in both clients why not use the numerical address instead?
EDIT. OK, so. I ran this benchmark myself on an 8-core Xeon running Linux. 2.13 GHz, CentOS 6.2. Kernel was 2.6.32-220.el6.x86_64. 50 gigs of RAM.
I got somewhere between 2.0 and 2.2 "milliseconds per ping" for Scala 2.9, and somewhere between 3.5 and 3.7 for Go 1.1. This is not the 10x difference that the authors reported, but it is something. The difference may be due in part to the different platform and hardware I am using.
Contrary to what I wrote earlier, I noticed that GOMAXPROCS=8 did seem to be slower than GOMAXPROCS=4 here. I got around 4 "milliseconds per ping" with GOMAXPROCS=8. Using a mutex and explicit condition variable shaved off maybe 0.2 milliseconds on average (very rough estimation).
Again contrary to what I wrote earlier, Nagle on versus off didn't seem to matter in the Go code. I still think you should always have it off for a test like this, but on my setup I did not see a difference.
I still don't think this benchmark is showing what they think it is. I have a hunch that this is more of a scheduler benchmark than a TCP benchmark at all. I think I'd have to haul out vtune to get any further, and I'm getting kind of tired (after midnight here).
> First of all, they're testing on MacOS, which is not going to be the platform they're actually using for the server code. The backends can be very different.
Unless you have evidence that Go has MacOS-only bugs, this is meaningless speculation (though I agree that it's possible that there's no problem on Linux and that people don't generally run mac servers, I'm not sure why we should privilege your hypothesis).
Agreed about GOMAXPROCS=4 - that seemed questionable to me (I don't know what it does, precisely, but I don't see a 4-anything limit in the Scala code).
Thanks to parent (after edit) for really testing. It turns out that he was right that Go on Mac is substantially worse than Linux - my bad. Maybe the Go lib authors didn't put much effort into reading the subtly different BSD/Darwin vs Linux syscall semantics.
To explain the Nagle algorithm's irrelevance to this case, we have to understand how it works. It doesn't delay any writes once you read. It only affects two consecutive small writes (my memory was fuzzy so I checked http://en.wikipedia.org/wiki/Nagle's_algorithm ). Odds of preemption between the client's write and its read seem small, so it shouldn't matter whether you Nagle or not.
Yeah, I always seem to forget the exact details of Nagle (probably because every project I've worked on just turns it off). Write-write-read is the killer, I guess-- just doing a small write followed by a small read, or vice versa, should not be affected by Nagle. So the results I got make sense.
Re: MacOS, I do know that some of the Go developers use Macs as their primary desktops. So I don't think they neglect it, but given that they're targeting the server space, it makes sense to optimize Linux more.
I still haven't seen any really good explanation of these results. I don't buy the argument that the JVM is providing the advantage here. The main thing that the JVM is able to do is dynamically recompile code, and this shouldn't be a CPU-bound task.
It's also hard to accurately profile Go programs on OS X because of bugs in it's now quite stale kernel. Specifically SIGPROF on OS X isn't always sent to the currently executing thread. Afaik this isn't a problem on newer FreeBSD kernels.
Also, that's kind of a lot of code. Here's my rewrite of the server: http://play.golang.org/p/hKztKKQf7v
It doesn't return the exact same result, but since you're not verifying the results, it is effectively the same (4 bytes in, 4 bytes back out). I did slightly better with a hand-crafted one.
A little cleanup on the client here: http://play.golang.org/p/vRNMzBFOs5
I'm guessing scala's hiding some magic, though.
I made a small change to the way the client is working, buffering reads and writes independently (can be observed later) and I get similar numbers (dropped my local runs from ~12 to .038). This is that version: http://play.golang.org/p/8fR6-y6EBy
Now, I don't know scala, but based on the constraints of the program, these actually all do the same thing. They time how long it takes to write 4 bytes * N and read 4 bytes * N. (my version adds error checking). The go version is reporting a bit more latency going in and out of the stack for individual syscalls.
I suspect the scala version isn't even making those, as it likely doesn't need to observe the answers.
You just get more options in a lower level language.