This take us overly cynical in this case—the original and primary use case for LLMs is to model languages in a comprehensive way. It's literally in the name.
Hallucinations rarely make up invalid grammar or invent non-existent words, what we're concerned about is facts, which isn't relevant at all when the goal is language preservation.
Depending on how powerful the language modeling is, I suspect it could lead to an LLM which will confidently and convincingly tell you how to say things like "floppy disk" and "SSD" in every extinct language and even those that went extinct long before computers ever existed, which is... interesting, but not exactly truth.
I've seen LLMs hallucinate nonexistent things in programming languages. It's hard to believe it won't do the same to human ones.
Importantly, the hallucinating non-existent things in programming languages is still stringing together valid English words to make something that looks like it ought to be a correct concept. It doesn't construct new words from scratch.
If a language model were asked what the word for "floppy disk" was in an extinct language and it invented a decent circumlocution, I don't think that would be a bad thing. People who are just engaging with the model as a way of connecting with their cultural heritage won't mind much if there is some creative application, and scholars are going to be aware of the limitations.
Again, the misapplication of language models as databases is why hallucinations are a problem. This use case isn't treating the model as a database, it's treating the model as a model, so the consequences for hallucination are much smaller to the point of irrelevance.
I also don't think it was ever much of a problem for machine translation to begin with. Also modern conversational systems are already addressing the problem (with things like contrastive/counterfactual fine-tuning and RAG/in-context learning) and will just tell you if it doesn't know something instead of hallucinating. But I'm pretty sure op doesn't know the difference between a language model and conversational model anyway, its just the generic "anti-LLM" opinion without much substance.
How would terms like "floppy disk" and "SSD" even appear in the target language if those terms weren't around when the speakers of the language was alive? Or you're thinking a multi-language LLM that tries to automatically translate between terms it didn't actually see in the source/target language during training?
It's common in non-dead languages. If your language is smaller you're bound to have a lot of "foreign words" in it. To this day I find it funny how my native language teachers would start to speak with tons of english and french words when they wanted to "showcase the complexity/beauty of our native language"/appear smart.
An LLM should do fine with that since it's usually the foreign word spelled in a way that makes sense in that language. I'm more curious about the inverse though. It's sometimes quite difficult to explain the meaning of a word in a language that does not have an equivalent, be it because it has a ton of different meanings or because it's some very specific action/object.
They're kind of bad at pretty much all languages, except simpler forms of english and Python. The tonality in the big LLM:s tends to be distinctly inhuman as well.
I suspect it'll be hard to find more material in some obscure, dying language than there is of either of those in the common training sets.
What is "they"? Are you saying transformer architecture somehow is biased towards English? Or are you saying that existing LLMs have that bias?
The only way this project is going to make sense will be to train it fresh on text in the language to be preserved, in order to avoid accidentally corrupting your model with English. If it's trained fresh on only target language content, I'm not sure how we can possibly generalize from the whole-internet models that you're familiar with.
I don't really care about the minutiae of the technical implementations, I'm talking from the experience of pushing text into LLM:s locally and remote, and getting translation in some direction or other back.
To me it doesn't make sense. It seems like an awful way to store information about a fringe language, but I'm certainly not an expert.
and getting translation in some direction or other back.
This seem to make a lot of English speakers upset, that LLM outputs appear translated from perspectives of primarily non-English speakers. But hey, it's n>=2 even at HN now.
I don't know, the translation errors are often pretty weird, like 'hot sauce' being translated to the target equivalent of 'warm sauce'.
Since LLM:s work by probabilistically stringing together sequences of words (or tokens or whatever) I don't expect them to become fully fluent ever, unless natural language degenerates and loses a lot of flexibility and flourish and analogy and so on. Then we might hit some level of expressiveness that they can actually simulate fully.
The current hausse is different but also very similar to the previous age of symbolic AI. Back then they expected computers being able to automate warfare and language and whatnot, but the prime use case turned out to be credit checks.
What language have you tried that they're bad at? I've tried a bunch of European languages and they are all perfect (or perfect enough for me to never know otherwise)
Swedish, german, spaniard spanish, french and luxembourgish french.
Sometimes they do a decent translation, too often they trip up on grammar, vocabulary or just assuming that a string of bytes means the same thing always. I find they work best in extremely formal settings, like documents produced by governments and lawyers.
Have you had the opportunity to interact with less wrapped versions of the models?
There's a lot of intentionality behind the way LLM's are presented from places like ChatGPT/DeepSeek/Claude, you're distinctly trying to talk to something that's actively limited in the way it can speak to you
It's not exactly nonexistant outside of them, but they make it worse than it is
Does it matter? Even most Chinese models are trained with <50% Chinese dataset last I checked, and they still manage to show AliExpress accent that would be natural for a Chinese speaker with ESL training. They're multilingual but not agnostic, they just can grow English-to-$LANG translation ability so long English stays the dominant and defining language in it.
I've run a bunch locally, sometimes with my own takes on system prompts and other adjustments because I've tried to make them less insufferable to use. Not as absurdly submissive, not as eager to do things I've not asked, not as censored, stuff like that.
I find they struggle a lot with things like long sentences and advanced language constructs regardless of the natural language they try to simulate. When it doesn't matter it's useful anyway, I can get a rough idea about the contents of documents in languages I'm not fluent in or make the bulk of a data set queryable in another language, but it's like a janky hack, not something I'd put in front of people paying my invoices.
Maybe there's a trick I ought to learn, I don't know.
This sounds like a grounded use of LLMs. Presumably they're feeding indigenous-language text into the NN to build a model of how the language works. From this model one may then derive starting points for grammar, morphology, vocabulary, and so on. Like, how would you say "large language model" in Navajo? If fed data on Navajo neologisms, an LLM might come up with some long word that means "the large thing by means of which, one can teach metal to speak" or similar. And the tribal community can take, leave, or modify that suggestion but it's based on patterns that are manifest in the language which AI statistical methods can elicit.
Machine learning techniques are really, really good at finding statistical patterns in data they're trained on. What they're not good at is making inferences on facts they haven't been specifically trained to accommodate.
The next few decades are going to be really, really weird.