Python is king in analysis, but for big data engineering, most of the building blocks (as mentioned elsewhere in this thread, Akka, Kafka, Flink, Spark, etc) are written in Scala
pyspark might be the go-to language for data scientists playing with the spark repl, or MLLib, but for production data engineering, scala is still king.
Besides performance and the obvious fact that not knowing scala makes it difficult to understand the underlying Spark code, there are multiple ways in which scala is more natural to develop in (many libraries are for scala only, for example).
I don't think so. Python and data frames is arguably more natural to think about and reason than scala.
I have no doubt that scala is more performant and the "fat" jar mechanism makes dependency management and codeshipping very easy (it's still tricky to install python dependencies on your spark nodes), but the pandas ecosystem is definitely more intuitive to understand.
I have the impression you are leaning towards thinking of data analytics (pandas, data frames, etc) whereas I and some other commenters may be thinking more of more data pipelining kind of architectures, where you can't afford wrong typing, scale is quite large and you are not even doing the kind of operations pandas dataframes are useful for
It depends on the use case. Our work primarily revolves around extending Spark with custom pipelines, models, ensembles, etc. to be deployed into our production systems (petabyte scale). Scala was really the only way to go for us.
I can understand performance difference, but I have not generally seen a difference in building custom pipelines and ensembles .. although I grant I'm not at your scale yet.
What kind of specific pipelines did you have trouble in pyspark ?
Although we decided to start using Scala specifically because PySpark was not as performant (2.0 is not so far ago), a reasonable use case I keep always in mind is aggregation (and in general any API which is still not solid/experimental/under work). Python bindings are always the last to be available (because all groundwork is being done in Scala). We have a relatively large scale process that takes advantage of custom-built aggregation methods on top of groupedDatasets, where we can pack a good deal of logic in the merge and reduce steps of aggregation. We could replicate this in Python using reducers, but aggregating makes more sense semantically, which makes the code is easier to understand. Also, the testing facilities for Spark code under Scala are a bit more advanced than under Python (they are not super-great, but are better), even without considering that being strongly typed makes a whole kind of errors impossible, right out of the compiler.
I very, very rarely think of using PySpark (and I have way more experience with Python than with Scala) when working with Spark. In a kitchen setting, it would be like having to prepare a cake and having to choose between a fork and a whisker. I can get it done with the fork, but I'll do a better and faster job with the whisker.
Only checked the implementation of the "Arrow UDFs" recently, because I'm interested in the Arrow interaction (for curiosity), so still don't have a strong opinion. My main concern is that a lot of the PySpark systems are playing around how to interact and speed up the systems while still staying on top of the Scala base.
I'd recommend Dask (haven't tried it much but from all I've seen is top-notch) to anyone who wants Python all the way down (at least until you hit the C at the bottom) ;)
well we run a hundred machine cluster on Dataproc for doing our stuff. Dask is still not battle-tested, cloud ready (or available) and is generally harder to work with than pyspark.
In general, I will stay happily in the spark world using pyspark rather than go to Dask right now.
Being able to pass data through Arrow is a big improvement, but there's also a lot of serialisation going on you pay in Python. Also, if you want to do anything in the fancy areas (like, write your own optimisation rule for the SparkSQL optimiser) it's Scala. Even something simple as writing a custom aggregator is impossible in Python (at least it was in 2.2, haven't checked in 2.3 or "current" 2.4)
Scala is still primarily used for data engineering workloads due to the fact it is a JVM language. (There's Java too, but no one wants to write Java code)
PySpark is often used for data science experimentation, but is not as frequently found in production pipelines due to the serialization/deserialization overhead between Python and the JVM. In recent years this problem is less pronounced due to the introduction of Spark dataframes which obviates the performance differences between PySpark and Scala Spark, but for UDFs, Scala Spark is still faster.
A newer development that may change all this is the introduction (in Spark 2.3) of Apache Arrow, a in-memory column store engine which lets Python UDFs work with the in-memory object without serializing/deserializing. This is very exciting as this lets Python get closer to the performance of JVM languages.
I've played around with it on Spark 2.3 -- the Arrow interface works but still not quite production-ready but I expect it will only get better.
Many folks are making strategic bets on Arrow technology due to the AI/GPU craze (and an in-memory standard enables multiple parties to build GPU-based analytics [1]), so there is tremendous momentum there.
At some point I expect the relative importance of Scala on Spark will decrease with respect to Python. (even though Spark APIs are Scala native)