Oh the repetition issue is only on the non dynamic quants :) If you do dynamic quantization and use the 1.58bit dynamic quantized model the repetition issue fully disappears!
Min_p = 0.05 was a way I found to counteract the 1.58bit model generating singular incorrect tokens which happen around 1 token per 8000!
I think most of the model creators share their model usage examples so high at 0.6-0.7 simply because it's what a lot of the client apps use. IMO this is WAY too high unless you're doing creative writing.
Generally I set temp to 0-0.4 at absolute most.
min_p actually needs a little temperature to work effectively so with min_p I almost always use 0.2
Ye lower temp is also good :) Tbh its all trial and error - I found temp=1.5, min_p=0.1 to be very useful for pass@k type workloads - ie calling the LLM multiple times and aggregating.
temp=0 is also good for singular outputs. For classification tasks, it's better to actually inspect the logits.
But my goto setting is always setting min_p at least 0.01 or 0.05! It vastly suppresses incorrect rare random tokens from being created, and it helps massively!
Min_p = 0.05 was a way I found to counteract the 1.58bit model generating singular incorrect tokens which happen around 1 token per 8000!