"Chef works atop ssh, which – while the gold standard for cryptographically secure systems management – is computationally expensive to the point where most master servers fall over under the weight of 700-1500 clients. Salt’s approach was far simpler."
Does that assertion about Chef somehow don't apply to Ansible?
On the use case:
"I have this command I want to run across 1,000 servers. I want the command to run on all of those systems within a five second window. It failed on three of them, and I need to know which three."
Well, ansible by default runs with paramiko which is a python implementation of SSH protocol. It will also keep connections open for multiple commands. It also has a pull mode and it also has a fireball mode which uses 0mq:
However, you're not forced to use this. In the beginning, you can just seed your CentOS or debian with a Kickstarter or seed file and then run your inital thing with ansible simply over ssh (using all the goodies, ssh-agent, password less ssh etc..).
One huge plus for ansible is also that it used yaml which is rather simple. I've been following both project for >1 year and it seems that recently ansible has picked up a lot and will probably make the "race" (IMO).
Salt also uses yaml for its configuration backend (by default). You can also write your state in Python if you prefer, with all the power that that brings (including pulling data from databases, remote API calls, or whatever you like).
I'm going to go on a limb here and say that the 700-1500 clients limitation is a non-problem for the vast majority. It's like the C10K "problem" all over again. Newsflash: most shops don't have more than a few dozen servers, if that, and the few ones that do must have already done their homework.
"Chef works atop ssh, which – while the gold standard for cryptographically secure systems management – is computationally expensive to the point where most master servers fall over under the weight of 700-1500 clients. Salt’s approach was far simpler."
Does that assertion about Chef somehow don't apply to Ansible?
On the use case:
"I have this command I want to run across 1,000 servers. I want the command to run on all of those systems within a five second window. It failed on three of them, and I need to know which three."