FYI regarding highly-tuned code -- An ex ATI/AMD GPU core designer told me that ...

cronin101 · on April 23, 2013

Unlike previous versions, OpenCL 2.0 been shown to only be about 30%[1] slower than CUDA and can approach comparable performance given enough optimisation.

Since I am working on code generation of Kernels to perform dynamic tasks, I can't afford to write at the lowest level available. (I'm accelerating Python/Ruby routines though so OpenCL gives a significant bonus without much pain at all.)

[1] http://dl.acm.org/citation.cfm?id=2066955 (Sorry about the paywall, I access through University VPN)