I am a data scientist and I care. The time when you could just do proof of conce...

kuzehanka · on Aug 18, 2019

I must be missing something. Modern data science workloads involve fanning out data and code across dozens to hundreds of nodes.

The bottlenecks, in order, are: inter-node comms, gpu/compute, on-disk shuffling, serialisation, pipeline starvation, and finally the runtime.

Why worry about optimising the very top of the perf pyramid which will make the least difference? Why worry if you spent 1ms pushing data to numpy when that data just spent 2500ms on the wire? And why are you even pushing from python runtime to numpy instead of using arrow?

gbrown · on Aug 18, 2019

Not everyone operates at that scale, and not every data science workload is DNN based

I agree with your general point, however, but the role I'd hope for with Rust is not optimizing the top level, but replacing the mountains of C++ with something safer and equally performant.

_pd19 · on Aug 18, 2019

But the title of this post is Python vs Rust, not C++ vs Rust. Maybe BLAS could be made safer but i don't think that's what's happening here

likeabbas · on Aug 18, 2019

A big push for NN is to get them running in real time on local GPUs so we can make AI cars and other AI tech a reality. 2500ms could be life and death in many scenarios

danielscrubs · on Aug 18, 2019

Every small thing counts when you have big data which is exactly why you need performance everywhere, if Rust can help with that I don’t mind switching my team to that.

The problem are usually when you do novel feature engineering not the actual model training.

But I was a C++ dev before checking the assembly for performance optimization so I guess I have more wiggle room to see when things are not up to snuff. If I got a cent for every: -you are not better than the compiler writers you can’t improve this.

Especially from the Java folks. They simply don’t want to learn shit, which is fine if they just where not so quick with the lies/excuses when proven wrong.

kahnjw · on Aug 18, 2019

This is just not true. The python runtime is not the bottleneck. DL frameworks are DSLs written on top of piles of highly optimized C++ code that is executed as independently from the python runtime as possible. Optimizing the python or swapping it out for some other language is not going to buy you anything except a ton of work. We can argue about using rust to implement the lower level ops instead of c++. That might be sensible though not from a perspective of performance.

In a "serving environment" where latency actually matters there are already a plethora of solutions for running models directly from a C++ binary, no python needed.

This is a solved problem and people trying to re-invent the wheel with "optimized" implementations are going to be disappointed when they realize their solution doesn't improve anything.

danielscrubs · on Aug 18, 2019

Yes. Let’s say you want certain features of a voice sample. You need to do that feature engineering every time before you send it to the model. Doesn’t it make sense to do it in C++ or Rust? This is currently already done. So if you already are starting to do parts of the feature engineering in Rust why not continue?

Yeah it’s not reasonable right now because Python has the best ecosystem. But that will not always be the case!

kahnjw · on Aug 18, 2019

I can’t exactly tell what you mean but I think you’re confusing two levels of abstraction here. C++ (or rust) and python already work in harmony to make training efficient.

1. In tensorflow and similar frameworks the Python runtime is used to compose highly optimized operations to create a trainable graph. 2. C++ is used to implement those highly optimized ops. If you have some novel feature engineering and you need better throughput performance than a pure python op can give you, you’d implement the most general viable c++ (or rust) op and then use that wrapped op in python.

This is how large companies scale machine learning in general, though this applies to all ops not just feature engineering specific ones.

There is no way that Instagram is using a pure python image processing lib to prep images for their porn detection models. That would cost too much money and take way too much time. Instead they almost certainly wrap some c++ in some python and move on to more important things.

danielscrubs · on Aug 19, 2019

I know. That’s how we do it too. You don’t see any benefits in instead of Python wrappers + C++ just do Rust? Especially in handling large data like voice iff there was a good ecosystem and toolbox in place?

kahnjw · on Aug 19, 2019

Maybe but then we’re no longer making an argument about performance, which is what I was responding to in your initial claim about “everything counts” and numpy shuffle being slow. That’s a straw man argument that has zero bearing on actual engineering decisions.

EDIT: clarification in first sentence

pohl · on Aug 18, 2019

The python runtime is not the bottleneck.

This smells like an overgeneralization. Often things that aren’t a bottleneck in the context of the problems you’ve faced might at least be an unacceptable cost in the context of the 16.6 ms budget someone else is working within.

kahnjw · on Aug 18, 2019

In what circumstance would one measure end to end time budget in training? What would that metric tell you? You don't care about latency, you care about throughput, which can be scaled nearly completely independently of the "wrapper language" for lack of a better term, in this case that's python.

It seems some commenters on this thread have not really thought through the lifecycle of a learned model and the tradeoffs existing frameworks exploit to make things fast _and_ easy to use. In training we care about throughput. Thats great because we can use a high level DSL to construct some graph that trains in a highly concurrent execution mode or on dedicated hardware. Using the high level DSL is what allows us to abstract away these details and still get good training throughput. Tradeoffs still bleed out of the abstraction (think batch size, network size, architecture etc have effect on how efficient certain hardware will be) but that is inevitable when you're moving from CPU to GPU to ASIC.

When you are done training and you want to use the model in a low latency environment you use a c/c++ binary to serve it. Latency matters there, so exploit the fact that you're no longer defining a model (no need for a fancy DSL) and just serve it from a very simple but highly optimized API.

kuzehanka · on Aug 18, 2019

Novel feature engineering? Like this? https://towardsdatascience.com/python-performance-and-gpus-1...

danielscrubs · on Aug 18, 2019

Looks good. I’ve tried Numba and that was extremely limited.

Current project we can’t use GPUs for production so we can only use it for development. Not my call, but operations. They have a Kubernetes cluster and a take it or leave it attitude.

We did end up using C++ for somethings and Python for most. I’d feel comfortable with C++ or Rust alone if there was a great ecosystem for DS though.

ldng · on Aug 18, 2019

I see a graph on ... logarithmic scale ? No unit ? I don't know what that benchmark means.

corndoge · on Aug 18, 2019

Good lord, hopefully latency isn't 2.5 seconds!

kahnjw · on Aug 18, 2019

Latency in training literally does not matter. You care about throughput. In serving, where latency matters, most DL frameworks allow you to serve the model from a highly optimized C++ binary, no python needed.

The poster you are replying to is 100% correct.

dnautics · on Aug 18, 2019

quote is 'data spend 2500 ms on the wire'. That's not latency. For a nice 10GbE connection, that's optimistically 3 GB or so worth of data. Do you have 3GB of training data? Then it will spend 2500 ms on the wire to distribute to all of your nodes as part of startup.

sjwright · on Aug 18, 2019

I can’t even. How could you ever get 2500 msec on transit? That’s like circling the globe ten times.

justinclift · on Aug 18, 2019

Maybe a bunch of SSL cert exchanges through some very low bandwidth connections? ;)

Still, it's more likely a figure used for exaggeration, for effect.

_pd19 · on Aug 18, 2019

You are a data scientist who seems to lack an understanding of how deep models are productionized..

I do so not unfrequently and I don't see how rust bindings would help me at all

danielscrubs · on Aug 18, 2019

or maybe, just maybe, you know nothing of optimizations, so you just go: impossible!

_pd19 · on Aug 18, 2019

the optimization you are describing is premature - you don't need rust to productionize your models and in most cases you don't even need to be coding in a low-level language at all.

danielscrubs · on Aug 18, 2019

I don’t think I’ll change your mind.

You clearly have more experience than I do that can solve everything in Python instead of C++ and/or CUDA.

You win.

_pd19 · on Aug 18, 2019

I mean, it'd be easier to change my mind if you had a single reason behind anything you've claimed.

I would love to understand why Rust would be more effective for productionizing ML models than the existing infrastructure written in Python/C++/CUDA.