One thing I do suspect people are running into is sampling issues. Gemma probably doesn't like llama defaults with its 256K vocab.
Many Chinese llms have a similar "default sampling" issue.
But our testing was done with zero temperature and constrained single token responses, so that shouldnt be an issue.