Could you talk more about your phone screens? What criteria are you filtering on...

SatvikBeri · on March 6, 2017

I'm mostly looking for people who want to learn about engineering with largeish datasets (~100gb/day for us), and have some of the prerequisite skills. Our codebase is mostly in Spark/Scala and uses functional programming idioms, so I'm looking for people who either know or want to learn how to use those. I'm also specifically trying to filter out people who mostly want a stats-heavy, machine learning heavy job, since that's not what we do.

An engineer who wants to learn data science is a great fit for us, an academic who wants to write R all day is not (though an academic who wants to learn engineering/functional programming is fine!)

Beyond that, I ask some questions about projects they've worked on, and in particular, how their approach would change if assumptions were different. Here I'm looking for the ability to reason backwards from a business goal, as opposed to somewhat blindly applying statistical techniques.

If they do well on these, we send the take-home exam. As previously noted, this is specifically designed to require relatively little knowledge but heavily test analysis skills, and lightly test programming skills. It's almost impossible to complete this exam without using Google effectively, so that's another thing I'm testing.

walshemj · on March 6, 2017

can I ask why functional programming in particular I can see why you might want to avoid java for big data - but isn't the average ML algo more in the procedural mould?

Would not python with numpy be a better fit ? or fortran with some handwave interface code

Back (early 80;s) when I did map reduce we used PL1/G

SatvikBeri · on March 7, 2017

The most direct reason is because the current team enjoys functional programming.

From a business standpoint though, there are a few main reasons:

–Data pipelines are well modeled as functions: they take a few input datasets, return a few outputs at the end, and do a ton of processing in between

–FP idioms generally make parallelization easier, and this is very important for the datasets we're dealing with

–A strong type system like Scala's lets us prevent many runtime errors, which is quite important when your pipelines can take several hours

–It's fairly trivial to wrap a statistical/ML algorithm in a pure functional interface, even if the algorithm itself is imperative

micro_cam · on March 7, 2017

Have you had performance issues getting things to conform to functional paradigms?

For example i've found that as a pipeline gets optimized for production use it needs to preallocate all of its output space and then modify things in at each step (like a one hot encoder flipping a few bits in specific rows of a zeroed array instead of allocating new ones and copying them in).

I find it difficult to reconcile this sort of code with a "pure functions without side effects" philosophy and still have it perform an an acceptable level.

SatvikBeri · on March 7, 2017

We're mostly doing ETL on large datasets, so the code needs to parallelize well, but beyond that performance isn't really a big concern. We use ML in research, but no models in production, because the costs of increased maintenance/lost transparency generally outweigh the benefits in our use case.

In jobs that were heavy on ML, I would use high-performance tools for the models (imperative code, numeric computing packages etc.) and functional code for the ETL, which worked pretty well–no need to be dogmatic about it, a 70% pure codebase is still generally easier to reason about than a 20% pure codebase.

walshemj · on March 7, 2017

Interesting I will have to have a proper look at Scala when I get my baby cluster up and running.

cjhanks · on March 7, 2017

Functional programming for a lot of numerical computing maps easier to mathematical notation. However, Scala is usually a worse choice than Java for numerical computing since everything is a boxed type.

kod · on March 7, 2017

This is straight up false, why do you think Scala doesn't have primitive values? Long will be either a value or reference type as needed, despite being spelled only one way instead of two different ways in java.

cjhanks · on March 7, 2017

A 'primitive type' is one which can be directly operated on by intrinsic CPU instructions. My understanding of Scala was that all objects (such as Long, Int...) are encapsulated inside of an object.

Therefore an array of boxed types will not be memory aligned; and any vector instructions (which are very important to scientific computing) cannot be used.

Perhaps something has changed in Scala land since I last looked(??).

premium-concern · on March 7, 2017

No, it has been worked the way GP wrote since approximately forever.