Compared to the best result for Java/Spark, which was running on 11x m3.xlarge instances on AWS. That's only 44 CPUs, plus it's running on AWS, not 100% dedicated hardware, so it's tough to tell what sort of an impact the virtualization + EBS has on performance. Plus, from the AWS page: "Each vCPU is a hyperthread of an Intel Xeon core except for T2 and m3.medium", which does not do anything good for the results.
Yes, technically, KDB was 199.80x faster (not 1000!) than Java/Spark, when it was given vastly superior, dedicated hardware without virtualization, and when tackling a problem that the hardware setup is optimized for. Note that the author calls this out by saying "This isn't dissimilar to using graphics cards" when talking about the setup he was using for the KDB benchmarks.
To get a sensible idea of the relative difference in performance, you would have to compare KDB and Java/Spark both running on the Xeon Phis, and/or running both on 11x m3.xlarge AWS instances - and even then, if Java/Spark does poorly on the Xeon Phi test, that might just mean that the Java/Spark developers haven't optimized for that particular setup.
> you would have to compare KDB and Java/Spark both running on the Xeon Phis, and/or running both on 11x m3.xlarge AWS instances - and even then, if Java/Spark does poorly on the Xeon Phi test...
If Spark can solve the business problem in less real-time in another way, I think that would be worth talking about, but it's my understanding that a bunch of mid/large machines connected to shared storage is the typical Spark deployment, and the hardware costs are similar to the Phi solution.
So my larger question still stands: What is the value in this approach, if it's not faster or cheaper?
If "this approach" is using Java/Spark, instead of something that is a smaller binary, then there are some easy answers to your questions:
- people don't want to write C (or K, or whatever yields a small binary)
- the cost of switching languages is not worth the speed-up
- it's already fast enough
I don't think you're wrong, overall, that, specifically, kdb can be much faster than an equivalently sized Spark cluster, but simply being faster does not invalidate other approaches, which is what you seem to be arguing for.
You're comparing KDB running on 4x Intel Xeon Phi 7210 CPUs, totaling 256 physical CPUs.
Compared to the best result for Java/Spark, which was running on 11x m3.xlarge instances on AWS. That's only 44 CPUs, plus it's running on AWS, not 100% dedicated hardware, so it's tough to tell what sort of an impact the virtualization + EBS has on performance. Plus, from the AWS page: "Each vCPU is a hyperthread of an Intel Xeon core except for T2 and m3.medium", which does not do anything good for the results.
Yes, technically, KDB was 199.80x faster (not 1000!) than Java/Spark, when it was given vastly superior, dedicated hardware without virtualization, and when tackling a problem that the hardware setup is optimized for. Note that the author calls this out by saying "This isn't dissimilar to using graphics cards" when talking about the setup he was using for the KDB benchmarks.
To get a sensible idea of the relative difference in performance, you would have to compare KDB and Java/Spark both running on the Xeon Phis, and/or running both on 11x m3.xlarge AWS instances - and even then, if Java/Spark does poorly on the Xeon Phi test, that might just mean that the Java/Spark developers haven't optimized for that particular setup.