Usually the adjacency between layers is complete, so your modification isn't quite without loss of generality.
I also cannot imagine a situation in which you'd know that much detail, but using a general network would allow one to, for example, dynamically modify the topology of the graph (as real neural networks do regularly).
EDIT: I guess what I'm asking for is a rigorous proof that the two models are equivalent with as little overhead as you say there should be.
I also cannot imagine a situation in which you'd know that much detail, but using a general network would allow one to, for example, dynamically modify the topology of the graph (as real neural networks do regularly).
EDIT: I guess what I'm asking for is a rigorous proof that the two models are equivalent with as little overhead as you say there should be.