*"To reproduce the results reported in the paper, you would need an NVIDIA DGX1 ...

AdamDKing · on April 12, 2019

That line refers to training the model from scratch. You can still run the trained model very quickly with one "cheap" GPU.

That said, I'm not sure why one wouldn't get a similar result training on the EC2 or GCE instances that have 8 V100s. Or even training with fewer GPUs but accumulating gradients to get the same batch size.