I keep dreaming of a world where memcpy is an instruction to the memory controller only and doesn't hold up the CPU at all. It feels kind of silly to have this complicated piece of silicone with all its amazing processing capabilities just sit there reading RAM into registers and then writing them back doing precisely nothing.
This is done but mostly for peripheral IO (like DACs/ADCs, disk, video or audio codecs and even between non-shared memory for processors) but can (and has) been used in the main memory hierarchy . http://en.wikipedia.org/wiki/Direct_memory_access
Some game systems did this too. For instance the Gameboy Advance had DMA, and it was used all the time because if you were eating up your cycles on copying memory, you would run out of time to process your game logic before your vertical blank ended and the system started drawing onscreen.
I also saw it used to save memory. Onscreen sprites had to be placed into a small memory-mapped address range, so instead of putting every character animation frame into this space and using it all up, I saw just one block designated for the character, and it was animated by DMAing the frames sequentially into that space.
Yeah, sure you could implement that in the memory controller, but it seems like a waste of silicon to do something that you really shouldn't be needing to do very often anyways. And even if it could, it would be saturating the bus bandwidth that the CPU would be waiting on otherwise. So you're most likely to just end up with an idle CPU anyways, so might as well use the silicon we have already.
Edit: Ohh, I forgot to mention that you will have to update/invalidate any resident cache lines on the CPU also, which requires communicating with the CPU.
I believe the Amiga was the first computer system to just this - although with the massive CPU board caches that we have today, I don't think this would be as useful.
The memory controller is not a great place for this in general - for instance, imagine memcpying to a 8kB region whose 2nd page has been flushed to disk. Now the memory controller needs to be able to resolve page faults...