Rather awkwardly, you've asked about an instruction that isn't currently implemented. :D Support for wmma and friends is in development.
But in general the answer to your question is yes: we use AMD-specific builtins where available/efficient to make things work. Otherwise many things would be unrepresentble, not just slow!
Add one: it's trivial to add a compiler builtin to carry the instruction from the frontend to the backend if an instruction exists and the backend knows about it.
If there's no instruction, either, you can write a C++ function to replicate the behaviour and codegen a call to it. Since the PTX blocks are expanded during initial IR generation, it all inlines nicely by the end. Of course, such software emulation is potentially suboptimal (depends on the situation).
But in general the answer to your question is yes: we use AMD-specific builtins where available/efficient to make things work. Otherwise many things would be unrepresentble, not just slow!