Hacker News new | past | comments | ask | show | jobs | submit login

65 seconds



This is now 240 seconds on a Google Cloud N2D instance. I successfully recreated my inference worker on a Google preemptible instance. My monthly cost went from $40 on Digital Ocean to ~$16. It's much slower though.


Nice big cost reduction! Is this using BudgetML from the post? Have you tried optimizing the model (i.e quantization and converting to something like ONNX)? I know this can bring big speed gains on T5 on CPU (5x faster), another generative model. More info here: https://discuss.huggingface.co/t/speeding-up-t5-inference/18...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: