As a random thought, this seems to be about the same order of magnitude compute ...

As a random thought, this seems to be about the same order of magnitude compute as Karpathy's recent GPT-2 work:

https://github.com/karpathy/llm.c/discussions/677

You could take the final checkpoint from that page and run it for some additional steps and see if it improves? You could always publish the final checkpoint and training curves - someone might find it useful.