It'll be 7B they're referring to, on my M1 Max 32GB with a 4000 token output request I get 67ms/token on 7B (4bit) and 154ms/token on 13B (4bit)... I've made a tweak to the code to increase the context size but it doesn't seem to change perf.
main: mem per token = 22357508 bytes
main: load time = 2741.67 ms
main: sample time = 156.68 ms
main: predict time = 11399.12 ms / 154.04 ms per token
main: total time = 14914.39 ms
Very usable!