Those models aren't worth much if you don't test them, so these efforts will have to go hand in hand. The way it tends to go now is RCT -> try to explain findings by hypothesizing about mechanisms -> think about an RCT that can test this hypothesis -> repeat. It's a slow and expensive process and not always successful, but I would say that it's progress.
As a generous guess at most 40 of these models are wrong, since they all describe basically the same few events with vastly different mechanisms. I guess that majority of interesting but wrong models is what you are after?
The space of incorrect-but-reasonable models is infinitely larger than the space of correct models. The way to distinguish between the two is experiment. Experiment is therefore the much more important side of the balance.