Hacker News new | past | comments | ask | show | jobs | submit login

I'm getting 56.38 ms per token on my 32GB M1 Max using this code on the 7GB model.

Very usable!




what model are you using?

edit: i mean 6B, 13b or 30B?


It'll be 7B they're referring to, on my M1 Max 32GB with a 4000 token output request I get 67ms/token on 7B (4bit) and 154ms/token on 13B (4bit)... I've made a tweak to the code to increase the context size but it doesn't seem to change perf.

  main: mem per token = 22357508 bytes
  main:     load time =  2741.67 ms
  main:   sample time =   156.68 ms
  main:  predict time = 11399.12 ms / 154.04 ms per token
  main:    total time = 14914.39 ms


This was generating 2000 tokens, so it seems to get slightly faster on longer generation runs maybe?




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: