Speaker adaptation, unlikely to be deployed? There are plenty of really big production systems with deployed speaker adaptation, whether that just be saving CMVN stats or saving i-vectors. I've worked on a couple of them.
w.r.t. run time, though, agreed. Hearing the IBM folks say "... 10" in response to the "what's the RTF" question was funny.
(and, agreed, at this point the switchboard announcements are definitely just marketing.)
Yup! Good production systems shoot for RTF ~ 1.0. This means that they can usually answer almost as soon as the speech is ended, because recognition is streaming.
And it's _really easy_ to increase accuracy by taking more time, by: building bigger DNN acoustic models; exploring a larger search space of hypotheses; using a slower language model (like an RNN) to rescore hypotheses; considering more possible pronunciations; etc....
(ML is usually a space / time / accuracy trade-off, so if you get phat accuracy gains at the cost of significant slow down, I'm usually unimpressed. The deepmind TTS paper _was_ impressive because it went beyond the best we can do, so even though it was 90 minutes to generate 1 second of speech, it's cool because it shows where we can go. TBH all of these switchboard papers don't do a ton of new stuff, they just get more aggressive about system combination and tuning hyperparameters.)
w.r.t. run time, though, agreed. Hearing the IBM folks say "... 10" in response to the "what's the RTF" question was funny.
(and, agreed, at this point the switchboard announcements are definitely just marketing.)