I suspect OpenAI’s advantage is their ability to synthesize a good fine tuning dataset. My question would be is this leaking data from the fine tuning dataset or from the initial training of the base model? The base model training data is likely nothing special.