Usually the adjacency between layers is complete, so your modification isn't qui...

Usually the adjacency between layers is complete, so your modification isn't quite without loss of generality.

I also cannot imagine a situation in which you'd know that much detail, but using a general network would allow one to, for example, dynamically modify the topology of the graph (as real neural networks do regularly).

EDIT: I guess what I'm asking for is a rigorous proof that the two models are equivalent with as little overhead as you say there should be.