From my experiments with the Deepseek Qwen-32b distill model, the Deepseek model did not follow the edit instructions - the format was wrong. I know the distill models are not at all the same as the full model, but that could provide a clue. Combine that information with the scores, then you have a reasonable hypothesis.