I've tested a few of the self-hosted models, and also compared GPT 3, GPT 4, Palm 1, and Palm 2.
Currently, GPT 4 blows everything else out of the water, there is simply no comparison.
More importantly, the limited context size of the open-source models precludes them from being used for this type of task. It takes well over 10K tokens to load some of the longer Wiki pages, let alone the page, a prompt, a reference, and the output. The GPT 4 32K model could handle most such scenarios.
Wikimedia is perfectly capable of deploying LLMs without external help.