STM under a concurrent load can be expensive due to how modern CPUs work. Since you are updating values on different CPUs going to have cache invalidation. Then you have the issue if you are doing an expensive calculation in the transaction can cause the same transaction to fail repeatedly leading to starvation.
What exactly "isn't particularly fast"? :)
In any real-world application the cost of STM machinery itself is, at a first approximation, 0.
There is a cost to immutable data structures, but I don't think that's what you were talking about.