I'm getting 56.38 ms per token on my 32GB M1 Max using this code on the 7GB mode...

2Gkashmiri · on March 11, 2023

what model are you using?

edit: i mean 6B, 13b or 30B?

garblegarble · on March 11, 2023

It'll be 7B they're referring to, on my M1 Max 32GB with a 4000 token output request I get 67ms/token on 7B (4bit) and 154ms/token on 13B (4bit)... I've made a tweak to the code to increase the context size but it doesn't seem to change perf.

  main: mem per token = 22357508 bytes
  main:     load time =  2741.67 ms
  main:   sample time =   156.68 ms
  main:  predict time = 11399.12 ms / 154.04 ms per token
  main:    total time = 14914.39 ms

nl · on March 12, 2023

This was generating 2000 tokens, so it seems to get slightly faster on longer generation runs maybe?