AFAIK this is not quite right. They take only 1/4th of a cycle on average, but that is because of pipelining. if you have a dependency on the result of an ALU you will still have to wait the full latency (1 cycle) before you can continue.
This makes sense as the clock is also somewhat of the 'driving force' for pushing signals through the chip from one part to the other. (some architectures have 'zero cost' operations I believe, but these are usually baked into the pipeline and have to be turned on-or-off depending on need).
You can do so, but on modern processors it's ~4 cycles to access the L1 cache and
takes 3 cycles (cylce 1, do the first & and the ^; cycle two, do the +; cycle 3, do the second &).