Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Deploy ML Models on a Budget (github.com/ebhy)
117 points by htahir111 on Feb 1, 2021 | hide | past | favorite | 47 comments



What does this offer that cortex[1] does not? With cortex supporting scale to zero it costs me less than $1/mo to idle and autoscales to a cap only when I need it even using spot instances when available if it is configured.

Disclaimer, not affiliated just enjoy the entire experience of using cortex. We are moving to KubeFlow because we need scale and more features but cortex is great for what it does.

[1]: https://www.cortex.dev/


Big fan of cortex and i'm also on its boards often :-)

However, this works without a Kubernetes cluster, while Cortex does not. A managed kubernetes cluster on GCP e.g. costs $100/month[1].AWS, Azure is similar. For bare-metal, self managed solutions, if you can manage a cluster on your own like that then yes there is no need for BudgetML :-)

[1]: https://cloud.google.com/kubernetes-engine/pricing


> A managed kubernetes cluster on GCP e.g. costs $100/month

As the pricing link notes, it's worth pointing out your first cluster is free (aside from VM costs), and people interested in this type of project probably don't have any other clusters.


Chiming in here to +1 Cortex for fast deployments at a small-ish startup. We have 'spike-y' loads and Cortex removes the cognitive overheads of autoscaling, spot instance management, etc.


How does it compare to Hydrosphere? https://docs.hydrosphere.io/


I rolled my own version of this for a side project. I ended up on a Digital Ocean droplet with RAM upgrades.

https://lucidbeaming.com/blog/fine-tuning-a-gpt-2-language-m...

The extra monthly $40 for a novelty project has me searching for cheaper alternatives. Google preemptibles scared me off because they can scale dynamically and generate runaway costs if my project suddenly got popular on Reddit or something.


Preemptible instances should not have any associated auto-scaling. It is common for people to run pre-emptible instances as part of a "Managed Instance Group", with which you can associate some kind of autoscaling policy.

You can run preemptible instances without any kind of management or with only very simple management and you should be okay. I recommend a simple managed group with a policy of, "I want maximum one of these instances running at all times and, if it gets preempted, then please start a new one."

If you're using Google Cloud, don't deny yourself the cost savings of preemptible instances. When I was using the same infrastructure, preemptions were very rare!


Nice! Thats a cool side project. Not sure how Google preemptible can scale though. They are not auto-scaling - so its a fixed cost


I swear I saw something in the terms about charges from sustained uptime, which is related. A popular resource would trigger that. It starts and stops on demand so that automation is what I am leery of. I will look into it again though, maybe I misinterpreted the fine print.


Ah no thats more for AWS spot instances I think. In Google, preemtible rates are constant, which makes it way more deterministic


What's the average inference time like for your model on DO?


65 seconds


This is now 240 seconds on a Google Cloud N2D instance. I successfully recreated my inference worker on a Google preemptible instance. My monthly cost went from $40 on Digital Ocean to ~$16. It's much slower though.


Nice big cost reduction! Is this using BudgetML from the post? Have you tried optimizing the model (i.e quantization and converting to something like ONNX)? I know this can bring big speed gains on T5 on CPU (5x faster), another generative model. More info here: https://discuss.huggingface.co/t/speeding-up-t5-inference/18...


100% agreed that Kubernetes is overkill for many if not most deployments. We constantly see small startups prematurely adopting Kubernetes which is a costly investment in terms of building up internal knowledge and maintaining the cluster. Gradient (https://gradient.paperspace.com) is push to deploy service built on Kubernetes but you don't need to know anything about Kubernetes to use it. We feel like is the right way to leverage the power of Kubernetes unless you're at Netflix or Lyft scale. In Gradient, you just provide a model, you select an instance type (several affordable GPU options offered), and a docker container. Everything else e.g. autoscaling, auth, rolling updates, etc. is handled automatically. Kubernetes does an amazing job providing the backend for these operations but data scientists and even devops teams at startups should not be wasting time rolling their own Kubernetes cluster and installing/maintaining an inferencing service on top.


And what about just using BudgetML, which is free and open-source ;-) cheeky grin :-P


They are just two different solutions that have pros and cons just like any two solutions :) A few that jump out:

- Setup time: Setting up GCP, setting up a certificate, adding a static IP, etc. is not seamless/adds friction

- Autoscaling and rolling updates (no downtime)

- Team management and collaborative environment with usage tracking, permissions, etc.

- Optional integration with a pipelining service for training, tuning, deploying models in a single tool

And a point of clarification: Practically speaking, neither tool is free. Both require a cloud instance so they will cost roughly the same for the end user (Gradient also supports preemptible instances).


if you wanna pay for it, better use ZenML! Ah wait, it's also open-source https://github.com/maiot-io/zenml


I have recently been wondering how people deploy high-throughput models in production which require a GPU to really hit their intended performance. My favorite hoster (Hetzner) doesn't offer GPU dedicated servers anymore, OVH doesn't seem to either, in general the problem seems to be around fraud wrt to coinmining etc. So what do people use instead? Dedicated EC2 instances still seem expensive.


Hi there, I work at OVH (product manager data/IA) :)

In fact we do offer GPUs from multiple ways

- Dedicated servers (baremetal) : hidden from the website today, we are refreshing the hardware parts. New ones with GPUs are planned in few weeks. But you will need to buy servers with GPUs for at least 1 one. Good but no so flexible. Not good for AI Training budget imho espect if it's 24/7 trainings.

- Public cloud / VMs : like AWS/GCP/AZure we provide VMs with NVIDIA 1 to 4 x Tesla V100 16GB The good thing is flexibility (pay go hourly) The "bad thing" is about perf when dealing with large dataset, and it also require sysadmin tasks (ssh, dist upgrade, you know it). You'll also need tools such as kubeflow to allow a team to work with pipelines, orchestration and debugging tools

- (NEW) Public Cloud AI Training : it's basically GPU as a service. like paperspace for example. You start a "job" via UI/CLI/API with 1 to 4 GPUs NVIDIA Tesla V100s, plugged to provided images (notebooks + tensorflow, notebooks / pytorch, fastAI, HuggingFace, ...), or you own images (hello docker public/private registries).

The good thing is : you have full flexibility (you can launch as many jobs as you want), pay per minute, start in 15 seconds. No SSH required, no drivers to install, ... And one last good thing is the hidden part : specific "cache" storage near the GPUs to play with large datasets (several TBs) without bottlenecks such as latency. It's as good a local NVMe storage the bad thing is : nothing :)

I'm not here for hidden advertisement but fuck yeah we have something to propose for GPU :p To find the links or price, everything is public in our website, for a free Voucher just DM me ! Have a good day.


Advanced scheduling with K8s... inference requests distributed across a ton of spot/preemptible VMs. We’ve achieved best performance on CPU with MKL.

It’s often a misconception that you need GPU for inference. It many cases, the overhead of data transfer to GPU makes it much slower than a well tuned CPU.



Intel has also recently announced AVX512 VNNI (Vector Neural Network Instructions), but it is only available for the Xeon CPU family. Not desktop. This should significantly improve inference performance on the cpu.


Even AVX512F (Foundation) is powerful: https://NN-512.com

(It depends on the CPU, how many cores, and the structure of the neural net, but MKL will generally only achieve between 80% and 90% of NN-512's AVX-512 performance)


How do you tune a CPU for inference?


You can generate custom SIMD inference code for your neural net (tensor shapes, etc.) and CPU cache hierarchy (L1, L2 per thread, etc.)

For example, here is ResNet50 for a particular Skylake-X CPU:

https://nn-512.com/browse/ResNet50


I didn't know this exists, thank you!


You usually tune the software that runs on it (compiler, libraries, build options etc.). You can also play the game of overclock if you run dedicated hardware.


I wondered what those options are. The other comment mention SIMD is interesting. Thanks :)


You should check out Lambda GPU cloud. It's around 40% of the price of AWS/GCP/Azure. The default instance image uses the Lambda Stack repository which is designed for Deep Learning (PyTorch/TensorFlow/Caffe/Drivers pre-installed) saving you time upfront.

If you're still using EC2 you should just switch to us, your monthly bill will thank you.

https://lambdalabs.com/service/gpu-cloud


OP here. Probably some form of container orchestration system like Kubernetes to manage the deployments + resources. There is also specialized ML tooling like Kubeflow to help out. Its still a developing landscape though.

We made BudgetML not to replace the above but to make it easy for scenarios where the need would be to just "get out there and deploy". Do you see value in that too?


OVH does but they don't put it front and centre. It's as an optional add-on when you're configuring your server.


What kind of fraud wrt coin mining does a GPU make possible?


Card fraud. Stolen card mine Bitcoin and run like hell. Can’t reverse the crypto so doesn’t matter if card gets detected a day later


Thanks everyone for the great recommendations!


How does this compare to the MLFlow Models project? I like MLFlow's relatively full ML life-cycle management, but would also like something simpler.

https://mlflow.org/docs/latest/models.html


How long have you used MLflow for what kind of models? Are you using it personally or do you use it as a team?

As far as I can tell, it cannot deploy models that use high dimensional input. It can log these models, and it can deploy them, but the "deployed" model can only accept two dimensional data.

This is an open issue[0] with open pull requests. We're experimenting with MLflow for our platform (https://iko.ai), but we automatically detect models, metrics, and parameters and then track that with MLflow instead of haivng people pollute their notebook code with MLflow code. We don't like leaking that detail (tracking code) into the notebook.

We circumvented these issues (https://iko.ai/docs/appbook/#deploying-a-model). The models can either be deployed, or you can build a Docker image and push it to a registry and do whatever you want with it (run a container and invoke the model's endpoint, for example).

- [0]: https://github.com/mlflow/mlflow/issues/3570


Yes this is aimed at being a simpler API rather than complicated one. Also MLFlow does not orchestrate self-starting preemptible instances, so you'd have to pay the full amounts.


Thank you htahir111 for mentioning ZenML! Let's integrate BudgetML -> https://github.com/maiot-io/zenml


Hope you two aren't self-promoting your own project (this comment sounds as though htahir111 has no relation to ZenML while htahir111 one talks about ZenML as "we")

https://news.ycombinator.com/item?id=25989317


Yeah, good catch. We're different teams. Sorry, didn't wanna fool you :-)


I think every enterprise tries to build this themselves and ends up buying a very expensive platform when their internal project fails. Thanks for creating this.


A cost effective way of deploying models on AWS is to use 'zappa' (autotemplating for lambda + API gateway).

This just works, and costs my team less for a year than one week of Sagemaker or kubernetes architectures.

https://www.corbettanalytics.com/deploy-machine-learning-mod...


If you have a large model in excess of 500mb, any advice on the next best way of deploying as cost effectivtely as possible?


I was searching like this tools few days ago. Trying it now.


If you want to do "budget ML" for training or long-running GPU jobs, you should check out Lambda GPU cloud.

Our on-demand instances are 50% the cost of GCP/AWS. Our 8x V100 instances are $12/hr and our 1x Quadro RTX 6000 instances are $1.25/hr. They come set up with TensorFlow/PyTorch/Jupyter notebook so you don't need to do much DevOps work to get started.

https://lambdalabs.com/service/gpu-cloud


Hey @sabalaba, that sounds pretty interesting. Would you be willing to integrate with ZenML? https://github.com/maiot-io/zenml . We're looking for solutions like that that have synergies




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: