More

notsylver · 2025-08-09T02:02:29 1754704949

You already need very high end hardware to run useful local LLMs, I don't know if a 200gb vector database will be the dealbreaker in that scenario. But I wonder how small you could get it with compression and quantization on top

wafflemaker · 2025-08-09T13:18:24 1754745504

I'm no dev either and still set up remote ssh login to be able to use LaTeX at home PC from my laptop.
Also, with many games and dual boot on my gaming PC I still have some space left on my 2TB NVME SSD. And my not enthusiast MOBO could fit two more.

It took so much time to install LaTeX and packages, and also so much space, my 128GB drive couldn't handle it.

mwcz · 2025-08-09T13:35:29 1754746529

I've worked in other domains my whole career, so I was astonished this week when we put a million 768-len embeddings into a vector db and it was only a few GB. Napkin math said ~25 GB and intuition said a long list of widely distributed floats would be fairly uncompressable. HNSW is pretty cool.

OneDeuxTriSeiGo · 2025-08-09T03:17:24 1754709444

You can already do A LOT with an SLM running on commodity consumer hardware. Also it's important to consider that the bigger an embedding is, the more bandwidth you need to use it at any reasonable speed. And while storage may be "cheap", memory bandwidth absolutely is not.

varenc · 2025-08-09T03:14:20 1754709260

> You already need very high end hardware to run useful local LLMs

A basic macbook can run gpt-oss-20b and it's quite useful for many tasks. And fast. Of course Macs have a huge advantage for local LLMs inference due to their shared memory architecture.

derefr · 2025-08-09T18:59:41 1754765981

The mid-spec 2025 iPhone can run “useful local LLMs” yet has 256GB of total storage.

(Sure, this is a spec distortion due to Apple’s market-segmentation tactics, but due to the sheer install-base, it’s still a configuration you might want to take into consideration when talking about the potential deployment-targets for this sort of local-first tech.)

notsylver · 2025-07-11T09:29:07 1752226147

I'ts fun, I think it needs queues for different game modes because with 150 players you almost always get horded by neighbours. Being able to queue for a team game would make it a bit easier to learn I think

diggan · 2025-07-11T09:55:44 1752227744

There is a singleplayer mode too, which is how I learned to play before jumping into multiplayer.

notsylver · 2025-05-20T06:07:29 1747721249

I've had this too, especially it getting stuck at the very end and just.. never finishing. Once the usage-based billing comes into effect I think I'll try cursor again. What local models are you using? The local models I tried for autocomplete were unusable, though based on aiders benchmark I never really tried with larger models for chat. If I could I would love to go local-only instead.

notsylver · 2025-04-18T10:58:52 1744973932

I've been digitising family photos using this. I scanned the photo itself and the text on it, then passed that to an LLM for OCR and used tools to get the caption verbatim, the location mentioned and the date in a standard format. That was going to be the end of it, but the OpenAI docs https://platform.openai.com/docs/guides/function-calling?lan... suggest letting the model guess coordinates instead of just grabbing names, so I did both and it was impressive. My favourite was taking a picture looking out to sea from a pier and pinpointing the exact pier.

imposterr · 2025-04-18T11:04:50 1744974290

Hmm, not sure I understand how you made use of OpenAI to guess the location oh a photo. Could you expand on that a bit? Thanks!

notsylver · 2025-04-19T06:27:52 1745044072

I showed the model a picture and any text written on that picture and asked it to guess a latitude/longitude using the tool use API for structured outputs. That was in addition to having it transcribe the hand written text and extracting location names, which was my original goal until I saw how good it was at guessing exact coordinates. It would guess within ~200km on average, even on pictures with no information written on them.

notsylver · 2025-03-05T23:14:41 1741216481

Could you share your ACL setup? I haven't had time to look at it much but this sounds like exactly what I want to do.

prennert · 2025-03-06T09:54:32 1741254872

The ACLs might look a bit scary at first, but they are actually quite intuitive once you coded up a rule or two.

It basically works by tagging machines (especially those deployed with an API key) and grouping users. Then you set up rules which allow groups and tags can communicate with each other on specific ports. Since the default rule is DENY, you only need to specify rules for communication you actually want to allow.

For instance you would create a tag for `servers` and a group for `sre`. Then you setup an ACL rule like this to allow SRE to ssh into servers:

    "action": "accept",
    "src":    ["group:sre"],
    "dst": ["tag:server:22"]

Because there is no rule with `group:sre` in `src` and `dst`, SREs cannot connect to each others machines.

The tailscale docs are really good. And the videos they have are a great starting point if you dont come from a networking background.

[0]: https://tailscale.com/kb/1018/acls

notsylver · 2025-02-28T16:05:40 1740758740

opt-in but afaik they still show up in places unless you disable them

notsylver · 2025-02-14T09:11:45 1739524305

This looks a lot more impressive than a lot of GitHub Copilot alternatives I've seen. I wonder how hard it would be to port this to vscode - using remote models for inline completion always seemed wrong to me, especially with server latency and network issues

coder543 · 2025-02-14T09:15:34 1739524534

Based on the blogpost, this appears to be hosted remotely on baseten. The model just happens to be released openly, so you can also download it, but the blogpost doesn't talk about any intention to help you run it locally within the editor. (I agree that would be cool, I'm just commenting on what I see in the article.)

On the other hand, network latency itself isn't really that big of a deal... a more powerful GPU server in the cloud can typically run so much faster that it can make up for the added network latency and then some. Running locally is really about privacy and offline use cases, not performance, in my opinion.

If you want to try local tab completions, the Continue plugin for VSCode is a good way to try that, but the Zeta model is the first open model that I'm aware of that is more advanced than just FIM.

notsylver · 2025-02-14T09:35:40 1739525740

I'm stuck using somewhat unreliable starlink to a datacenter ~90ms away, but I can run 7b models fine locally. I agree though, cloud completions aren't unusably slow/unreliable for me, it's mostly about privacy and it being really fun.

I tried continue a few times, I could never get consistent results, the models were just too dumb. That's why I'm excited about this model, it seems like a better approach to inline completion and might be the first okay enough™ model for me. Either way, I don't think I can replace copilot until a model can automatically fine tune itself in the background on the code I've written

coder543 · 2025-02-14T09:41:38 1739526098

> Either way, I don't think I can replace copilot until a model can automatically fine tune itself in the background on the code I've written

I don't think Copilot does this... it's really just a matter of the editor plug-in being smart enough to grab all of the relevant context and provide that to the model making the completions; a form of RAG. I believe organizations can pay to fine-tune Copilot, but it sounds more involved than something that happens automatically.

Depending on when you tried Continue last, one would hope that their RAG pipeline has improved over time. I tried it a few months ago and I thought codegemma-2b (base) acting as a code completion model was fine... certainly not as good as what I've experienced with Cursor. I haven't tried GitHub Copilot in over a year... I really should try it again and see how it is these days.

notsylver · 2025-02-01T12:21:53 1738412513

I think it would be more interesting doing this with smaller models (33b-70b) and see if you could get 5-10 tokens/sec on a budget. I've desperately wanted something locally thats around the same level of 4o, but I'm not in a hurry to spend $3k on an overpriced GPU or $2k on this

gliptic · 2025-02-01T12:42:09 1738413729

Your best bet for 33B is already having a computer and buying a used RTX 3090 for <$1k. I don't think there's currently any cheap options for 70B that would give you >5. High memory bandwidth is just too expensive. Strix Halo might give you >5 once it comes out, but will probably be significantly more than $1k for 64 GB RAM.

ants_everywhere · 2025-02-01T13:44:52 1738417492

With used GPUs do you have to be concerned that they're close to EOL due to high utilization in a Bitcoin or AI rig?

gliptic · 2025-02-01T14:59:04 1738421944

I guess it will be a bigger issue the longer it's been since they stopped making them, but most I've heard (including me) haven't had any issue. Crypto rigs don't necessarily break GPUs faster because they care about power consumption and run the cards at a pretty even temperature. What probably breaks first is the fans. You might also have to open the card up and repaste/repad them to keep the cooling under control.

ants_everywhere · 2025-02-01T15:50:27 1738425027

awesome thanks!

EVa5I7bHFq9mnYK · 2025-02-11T21:19:49 1739308789

GPUs were last used for Bitcoin mining in 2013, so you shouldn't be concerned unless you are buying a GTX 780.

pmarreck · 2025-02-01T13:52:59 1738417979

M4 Mac with unified GPU RAM

Not very cheap though! But you get a quite usable personal computer with it...

gliptic · 2025-02-01T15:01:52 1738422112

Any that can run 70B at >5 t/s are >$2k as far as I know.

jjallen · 2025-02-01T13:14:23 1738415663

How does inference happen on a GPU with such limited memory compared with the full requirements of the model? This is something I’ve been wondering for a while

Gracana · 2025-02-01T13:44:24 1738417464

You can run a quantized version of the model to reduce the memory requirements, and you can do partial offload, where some of the model is on GPU and some is on CPU. If you are running a 70B Q4, that’s 40-ish GB including some context cache, and you can offload at least half onto a 3090, which will run its portion of the load very fast. It makes a huge difference even if you can’t fit every layer on the GPU.

jjallen · 2025-02-01T14:15:21 1738419321

So the more GPUs we have the faster it will be and we don't have to have the model run solely CPU or GPU -- it can be combined. Very cool. Think that is how it's running now with my single 4090.

ynniv · 2025-02-01T14:33:02 1738420382

Umm, two 3090's? Additional cards scale as long as you have enough PCIe channels.

gliptic · 2025-02-01T14:51:12 1738421472

I arbitrarily chose $1k as the "cheap" cut-off. Two 3090 is definitely the most bang for the buck if you can fit them.

api · 2025-02-01T13:02:00 1738414920

Apple M chips with their unified GPU memory are not terrible. I have one of the first M1 Max laptops with 64G and it can run up to 70B models at very useful speeds. Newer M series are going to be faster and they offer more RAM now.

Are there any other laptops around other than the larger M series Macs that can run 30-70B LLMs at usable speeds that also have useful battery life and don’t sound like a jet taxiing to the runway?

For non-portables I bet a huge desktop or server CPU with fast RAM beats the Mac Mini and Studio for price performance, but I’d be curious to see benchmarks comparing fast many core CPU performance to a large M series GPU with unified RAM.

jenny91 · 2025-02-01T13:58:18 1738418298

As a data point: you can get an RTX 3090 for ~$1.2k and it runs deepseek-r1:32b perfectly fine via Ollama + open webui at ~35 tok/s in an OpenAI-like web app and basically as fast as 4o.

kevinak · 2025-02-01T14:12:16 1738419136

You mean Qwen 32b fine-tuned on Deepseek :)

There is only one model of Deepseek (671b), all others are fine-tunes of other models

driverdan · 2025-02-01T14:18:37 1738419517

> you can get an RTX 3090 for ~$1.2k

If you're paying that much you're being ripped off. They're $800-900 on eBay and IMO are still overpriced.

bick_nyers · 2025-02-01T16:09:11 1738426151

It will be slower for a 70b model since Deepseek is an MoE that only activates 37b at a time. That's what makes CPU inference remotely feasible here.

firtoz · 2025-02-01T12:37:50 1738413470

Would it be something like this?

> OpenAI's nightmare: DeepSeek R1 on a Raspberry Pi

https://x.com/geerlingguy/status/1884994878477623485

I haven't tried it myself or haven't verified the creds, but seems exciting at least

gliptic · 2025-02-01T12:46:35 1738413995

That's 1.2 t/s for the 14B Qwen finetune, not the real R1. Unless you go with the GPU with the extra cost, but hardly anyone but Jeff Geerling is going to run a dedicated GPU on a Pi.

etra0 · 2025-02-01T14:13:45 1738419225

it's using a Raspberry Pi with a.... USD$1k gpu, which kinda defeat the purpose of using the RPI in the first place imo.

or well, I guess you save a bit on power usage.

unethical_ban · 2025-02-06T18:34:51 1738866891

I suppose it makes sense, for extremely GPU centric applications, that the pi be used essentially as a controller for the 3090.

firtoz · 2025-02-01T15:29:46 1738423786

Oh, I was naive to think that the Pi was capable of some kind of magic (sweaty smile emoji goes here)

spaceport · 2025-02-01T18:48:36 1738435716

I put together a $350 build with a 3060 12GB and its still my favorite build. I run llama 3.2 11b q4 on it and its a really efficient way to get started and the tps is great.

Svoka · 2025-02-01T16:20:51 1738426851

You can run smaller models on MacbookPro with ollama with those speeds. Even with several 3k GPUs it won't come close to 4o level.

notsylver · 2025-01-26T07:35:01 1737876901

Could you get the speed and location of the plane and estimate how far it should move across two images to determine if its the right target?

obviyus · 2025-01-26T07:37:36 1737877056

That would work if inference on the Pi was faster. Right now it takes about 2.5s per image. The planes are in view for maybe 3s. By the time the next frame is fetched, the plane's already out of view.

notsylver · 2024-12-19T13:26:43 1734614803

I've had the opposite experience - I tried continue.dev and for me it doesn't come close to copilot. Especially with copilot chat having o1-preview and Sonnet 3.5 for so cheap that I single handedly might bankrupt microsoft (we can hope), but I tried it before that was availlable and the inline completions were laughably bad in comparison.

I used the recommended models and couldn't figure it out, I assume I did something wrong but I followed the docs and triple checked everything. It'd be nice to use the GPU I have locally for faster completions/privacy, I just haven't found a way to do that.

fortyseven · 2024-12-19T14:04:09 1734617049

The last couple times I tried "continue" it felt like "Step 1" in someone's business plan; bulky and seconds away from converting into a paid subscription model.

Additionally, I've tried a bunch of these (even the same models, etc) and they've all sucked compared to Copilot. And believe me, I want that local-hosted sweetness. Not sure what I'm doing wrong when others are so excited by it.

archon810 · 2024-12-19T17:09:55 1734628195

I just tried Continue and it was death by 1000 paper cuts. And by that I mean 1000 accept/reject blocks.

And at some point I asked to change a pretty large file in some way. It started processing, very very slowly and I couldn't figure out a way to stop it. Had to restart VS Code as it still kept changing the file 10 minutes later.

Copilot was also very slow when I tried it yesterday but at least there was a clear way to stop it.