~18.6 GiB, according to nvtop. ollama 0.6.6 invoked with: # server OLLAMA_FLASH_... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		genpfault 3 months ago \| parent \| context \| favorite \| on: Qwen3: Think deeper, act faster ~18.6 GiB, according to nvtop. ollama 0.6.6 invoked with: `# server OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve # client ollama run --verbose qwen3:30b-a3b` ~19.8 GiB with: `/set parameter num_ctx 32768`

tgtweak 3 months ago [–]

Very nice, should run nicely on a 3090 as well.

TY for this.

update: wow, it's quite fast - 70-80t/s on LM Studio with a few other applications using GPU.

Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact