This is now 240 seconds on a Google Cloud N2D instance. I successfully recreated my inference worker on a Google preemptible instance. My monthly cost went from $40 on Digital Ocean to ~$16. It's much slower though.
Nice big cost reduction! Is this using BudgetML from the post?
Have you tried optimizing the model (i.e quantization and converting to something like ONNX)? I know this can bring big speed gains on T5 on CPU (5x faster), another generative model. More info here: https://discuss.huggingface.co/t/speeding-up-t5-inference/18...