In science, measuring things until you find a benefit is called p-hacking. Every...

In science, measuring things until you find a benefit is called p-hacking. Every extra test you do that splits your data along a different dimension, is another independent opportunity for "random chance" to look like positive signal.

There is no programming project in existence with enough developers working on it, that developer-productivity data derived from a change to it would not be considered "underpowered" for the sake of proving anything.