I tried to do this once with Theano, and found that the latency of the roundtrip to GPU and back made it not worthwhile for a single image. Maybe a batch of images at once would make it worthwhile. And this isn't what theano is intended for, admittedly - custom CUDA might do a better job.
I got curious about the numbers so I did a napkin calculation:
In a 2012 vintage Nvidia article[1] they get 5-6 GB/s in both directions (array size 4MB) which would be around 1500 Mpix/s with 8bit RGBA pixels.
15 Mpix image: Transfers both ways would take 20 ms, and given GPU kernel going at ~5x the CPU speed (CPU 30, GPU 150 Mpix/s), you would spend 100 ms doing the computation. So 120 ms on GPU vs 500 ms on the CPU.
Interesting, thanks. So it seems like it'd still be a pretty heavy win for the GPU.
Also, a common use-case on the web today is to have one input image and then a large number of output images (usually smaller) for different screen resolutions & thumbnails. Seems like you could save a lot of time by uploading the input image once and then running a bunch of resize convolutions for different output sizes while it's still in the GPU memory, then download the output files as a batch.