I think the better statement is that, if, say, you're running the Miller-Rabin test 10 times, you can be confident that an error in one test is uncorrelated with an error in the next test, so it's easy to dial up the accuracy as close to 1 as desired. Whereas with an LLM, correlated errors seem much more likely; if it failed three times parsing the same piece of data, I would have no confidence that the 4th-10th times would have the same accuracy rate as on a fresh piece of data. LLMs seem much more like the Fermat primality test, except that their "Carmichael numbers" are a lot more common.