You already need very high end hardware to run useful local LLMs, I don't know if a 200gb vector database will be the dealbreaker in that scenario. But I wonder how small you could get it with compression and quantization on top
I've worked in other domains my whole career, so I was astonished this week when we put a million 768-len embeddings into a vector db and it was only a few GB. Napkin math said ~25 GB and intuition said a long list of widely distributed floats would be fairly uncompressable. HNSW is pretty cool.
You can already do A LOT with an SLM running on commodity consumer hardware. Also it's important to consider that the bigger an embedding is, the more bandwidth you need to use it at any reasonable speed. And while storage may be "cheap", memory bandwidth absolutely is not.
> You already need very high end hardware to run useful local LLMs
A basic macbook can run gpt-oss-20b and it's quite useful for many tasks. And fast. Of course Macs have a huge advantage for local LLMs inference due to their shared memory architecture.
The mid-spec 2025 iPhone can run “useful local LLMs” yet has 256GB of total storage.
(Sure, this is a spec distortion due to Apple’s market-segmentation tactics, but due to the sheer install-base, it’s still a configuration you might want to take into consideration when talking about the potential deployment-targets for this sort of local-first tech.)
I'ts fun, I think it needs queues for different game modes because with 150 players you almost always get horded by neighbours. Being able to queue for a team game would make it a bit easier to learn I think
I've had this too, especially it getting stuck at the very end and just.. never finishing. Once the usage-based billing comes into effect I think I'll try cursor again.
What local models are you using? The local models I tried for autocomplete were unusable, though based on aiders benchmark I never really tried with larger models for chat. If I could I would love to go local-only instead.
I've been digitising family photos using this. I scanned the photo itself and the text on it, then passed that to an LLM for OCR and used tools to get the caption verbatim, the location mentioned and the date in a standard format. That was going to be the end of it, but the OpenAI docs https://platform.openai.com/docs/guides/function-calling?lan... suggest letting the model guess coordinates instead of just grabbing names, so I did both and it was impressive. My favourite was taking a picture looking out to sea from a pier and pinpointing the exact pier.
I showed the model a picture and any text written on that picture and asked it to guess a latitude/longitude using the tool use API for structured outputs.
That was in addition to having it transcribe the hand written text and extracting location names, which was my original goal until I saw how good it was at guessing exact coordinates. It would guess within ~200km on average, even on pictures with no information written on them.
The ACLs might look a bit scary at first, but they are actually quite intuitive once you coded up a rule or two.
It basically works by tagging machines (especially those deployed with an API key) and grouping users. Then you set up rules which allow groups and tags can communicate with each other on specific ports. Since the default rule is DENY, you only need to specify rules for communication you actually want to allow.
For instance you would create a tag for `servers` and a group for `sre`. Then you setup an ACL rule like this to allow SRE to ssh into servers:
This looks a lot more impressive than a lot of GitHub Copilot alternatives I've seen. I wonder how hard it would be to port this to vscode - using remote models for inline completion always seemed wrong to me, especially with server latency and network issues
Based on the blogpost, this appears to be hosted remotely on baseten. The model just happens to be released openly, so you can also download it, but the blogpost doesn't talk about any intention to help you run it locally within the editor. (I agree that would be cool, I'm just commenting on what I see in the article.)
On the other hand, network latency itself isn't really that big of a deal... a more powerful GPU server in the cloud can typically run so much faster that it can make up for the added network latency and then some. Running locally is really about privacy and offline use cases, not performance, in my opinion.
If you want to try local tab completions, the Continue plugin for VSCode is a good way to try that, but the Zeta model is the first open model that I'm aware of that is more advanced than just FIM.
I'm stuck using somewhat unreliable starlink to a datacenter ~90ms away, but I can run 7b models fine locally. I agree though, cloud completions aren't unusably slow/unreliable for me, it's mostly about privacy and it being really fun.
I tried continue a few times, I could never get consistent results, the models were just too dumb. That's why I'm excited about this model, it seems like a better approach to inline completion and might be the first okay enough™ model for me. Either way, I don't think I can replace copilot until a model can automatically fine tune itself in the background on the code I've written
> Either way, I don't think I can replace copilot until a model can automatically fine tune itself in the background on the code I've written
I don't think Copilot does this... it's really just a matter of the editor plug-in being smart enough to grab all of the relevant context and provide that to the model making the completions; a form of RAG. I believe organizations can pay to fine-tune Copilot, but it sounds more involved than something that happens automatically.
Depending on when you tried Continue last, one would hope that their RAG pipeline has improved over time. I tried it a few months ago and I thought codegemma-2b (base) acting as a code completion model was fine... certainly not as good as what I've experienced with Cursor. I haven't tried GitHub Copilot in over a year... I really should try it again and see how it is these days.
I think it would be more interesting doing this with smaller models (33b-70b) and see if you could get 5-10 tokens/sec on a budget. I've desperately wanted something locally thats around the same level of 4o, but I'm not in a hurry to spend $3k on an overpriced GPU or $2k on this
Your best bet for 33B is already having a computer and buying a used RTX 3090 for <$1k. I don't think there's currently any cheap options for 70B that would give you >5. High memory bandwidth is just too expensive. Strix Halo might give you >5 once it comes out, but will probably be significantly more than $1k for 64 GB RAM.
I guess it will be a bigger issue the longer it's been since they stopped making them, but most I've heard (including me) haven't had any issue. Crypto rigs don't necessarily break GPUs faster because they care about power consumption and run the cards at a pretty even temperature. What probably breaks first is the fans. You might also have to open the card up and repaste/repad them to keep the cooling under control.
How does inference happen on a GPU with such limited memory compared with the full requirements of the model? This is something I’ve been wondering for a while
You can run a quantized version of the model to reduce the memory requirements, and you can do partial offload, where some of the model is on GPU and some is on CPU. If you are running a 70B Q4, that’s 40-ish GB including some context cache, and you can offload at least half onto a 3090, which will run its portion of the load very fast. It makes a huge difference even if you can’t fit every layer on the GPU.
So the more GPUs we have the faster it will be and we don't have to have the model run solely CPU or GPU -- it can be combined. Very cool. Think that is how it's running now with my single 4090.
Apple M chips with their unified GPU memory are not terrible. I have one of the first M1 Max laptops with 64G and it can run up to 70B models at very useful speeds. Newer M series are going to be faster and they offer more RAM now.
Are there any other laptops around other than the larger M series Macs that can run 30-70B LLMs at usable speeds that also have useful battery life and don’t sound like a jet taxiing to the runway?
For non-portables I bet a huge desktop or server CPU with fast RAM beats the Mac Mini and Studio for price performance, but I’d be curious to see benchmarks comparing fast many core CPU performance to a large M series GPU with unified RAM.
As a data point: you can get an RTX 3090 for ~$1.2k and it runs deepseek-r1:32b perfectly fine via Ollama + open webui at ~35 tok/s in an OpenAI-like web app and basically as fast as 4o.
That's 1.2 t/s for the 14B Qwen finetune, not the real R1. Unless you go with the GPU with the extra cost, but hardly anyone but Jeff Geerling is going to run a dedicated GPU on a Pi.
I put together a $350 build with a 3060 12GB and its still my favorite build. I run llama 3.2 11b q4 on it and its a really efficient way to get started and the tps is great.
That would work if inference on the Pi was faster. Right now it takes about 2.5s per image. The planes are in view for maybe 3s. By the time the next frame is fetched, the plane's already out of view.
I've had the opposite experience - I tried continue.dev and for me it doesn't come close to copilot. Especially with copilot chat having o1-preview and Sonnet 3.5 for so cheap that I single handedly might bankrupt microsoft (we can hope), but I tried it before that was availlable and the inline completions were laughably bad in comparison.
I used the recommended models and couldn't figure it out, I assume I did something wrong but I followed the docs and triple checked everything. It'd be nice to use the GPU I have locally for faster completions/privacy, I just haven't found a way to do that.
The last couple times I tried "continue" it felt like "Step 1" in someone's business plan; bulky and seconds away from converting into a paid subscription model.
Additionally, I've tried a bunch of these (even the same models, etc) and they've all sucked compared to Copilot. And believe me, I want that local-hosted sweetness. Not sure what I'm doing wrong when others are so excited by it.
I just tried Continue and it was death by 1000 paper cuts. And by that I mean 1000 accept/reject blocks.
And at some point I asked to change a pretty large file in some way. It started processing, very very slowly and I couldn't figure out a way to stop it. Had to restart VS Code as it still kept changing the file 10 minutes later.
Copilot was also very slow when I tried it yesterday but at least there was a clear way to stop it.
reply