I wonder how they test the code? Maybe they can write a meta VM using a testable environment(e.g. in C) and transpile it into the instructions that library uses?
If I was them I’d test each part of the toolchain (which I assume is a high-level compiler of some sort to their RISC VM) independently, as you would for any component of this type. For the actual exploits itself it’s probably a regular debugger with facilities tailored to their VM.