Overall a good article with some insightful points. One thing strikes me as a bi...

Overall a good article with some insightful points. One thing strikes me as a bit off, though. The recommendation for one or two cores per GPU seems not quite right. Examining only the CPU<->GPU performance, this might be reasonable. Like the author mentions, you can use the other core to prep the next mini-batch and all those sorts of tasks. However, training the model is only one part of the system, and I tend to value the overall performance of the system more than any one facet.

For example, despite training on GPUs being very computationally intensive, I find one of the most onerous tasks to be the custom data prep/transformation/augmentation pipeline. Because these types of things are usually pretty application specific, there often isn't a ready-made toolkit that does all the heavy lifting for you (unlike the GPU training, which has Torch, Caffe, pylearn, cuda-convnet, lasagne, cxxnet...), so you end up having to roll it yourself. Often you have to run this code often, and with large data that isn't trivial. Usually you won't invest--at least, I don't--in doing custom CUDA code for this type of thing, if it's even possible, so having lots of fast CPU cores is a win. I usually write multi-threaded routines for my processing steps and run them on 8-32 cores for huge gains. So my point is that "one or two cores per GPU" is a bit of a narrow recommendation.

The same applies if you want to do 'real time' data augmentation (this is hinted at later in the article) and/or if you want to deploy with CPU only. Sure you need the GPU to do the training in a reasonable amount of time, but once you've fit your model, it might not be worth it to deploy to GPU-enabled computers if all you're doing is forward passes.

PS: This is also a place where running on EC2 can be a win. Maybe it's more economical to build a workstation, but once you're in the cloud you can spin up a few 32 core boxes to run your preprocessing really quickly, shut them down and spin up some GPU instances for training, then shut those down and spin up some mid-tier boxes to run the models through a bunch of data without breaking the bank. All in 'one place'.