I believe this is more of an optimization layer to be utilized by libraries like Tensorflow and JAX. More of a simplification of the interaction with traditional CUDA instructions.
I imagine these libraries and possibly some users would implement libraries on top of this language and reap some of the optimization benefit without having to maintain low-level CUDA specific code.
XLA is domain-specific compiler for linear algebra. Triton generates and compiles an intermediate representation for tiled computation. This IR allows more general functions and also claims higher performance.
Without reading the paper, I think you have it a little backwards - the IR doesn't itself allow for more general functions. More general functions are possible (in theory) because the frontend (this Triton language) is decoupled from the backend (CUDA) through the IR as an interface. In this way the Triton IR is no less domain specific than XLA (because both are IRs that represent sequences of operators that run on GPU (or TPU or whatever). I guess in theory Triton could be eschewing all of eg cuDNN but most likely it's not as NVIDIA's closed source kernels perform best on their closed source hardware.
Edit: should've read the post before commenting. Looks like they are in fact using LLVM's PTX backend (ie generating cuda kernels from scratch). Kudos to them
I imagine these libraries and possibly some users would implement libraries on top of this language and reap some of the optimization benefit without having to maintain low-level CUDA specific code.