Efficiency is the expected answer. I'm just wondering if there's a more theoretical reason, such as "every function that can be computed by a non-layered acyclic network can be computed by a complete layered network using only a small number of extra nodes/layers."
I think that it can. With some weights of 0 and some weights of 1, you can trivially map 'jumps' that skip from a node in one layer to a node a couple layers distant, by means of some incorporate-no-other-inputs intermediate nodes, right? Sigmoid function on 1 is still 1? Once you have those, it's just a matter of how many layers you need for any acyclic structure, I think.
Although if you wanted to come up with difficult scenarios, it's not hard to think of structures that would make some of those middle layers really tall, or add a lot of middle layers.
As I mentioned in another branch of this thread, selectively choosing edges between nodes isn't an option, because in the standard model you have complete incidence between nodes in adjacent layers.