This sounds like an interesting way to ensure accurate quotes:
> To accurately generate the exact passages in the given corpus, we employ a trie-based constrained decoding algorithm (Chen et al., 2020; Cao et al., 2021; Lu et al., 2021) in which the generated tokens can be constrained in the dynamic vocabulary. Specifically, instead of generating a token from the entire target vocabulary at each step, we use a prefix tree (trie) to constraint the target vocabulary and ensure that the generated content is within the corpus. During the construction of trie, we remove stop words from the initial token to improve semantic representation of the trie.
Projects like llama.cpp/ollama should make it automatic/dynamic, just rely on triple quote sections where you enter those constraint modes automatically.
Ie. every time you enter into section starting with "```json" you automatically switch to JSON BNF.
Every time you enter into "```json:Foo" you enter JSON BNF + JSON-SCHEMA for Foo object definition.
"```python" for python grammar etc.
"```quote:documentRef" you enter trie based constraints.
"```llm:otherllm" you enter other llm.
"```whatever:whatever" you enter whatever you want.
If you want just json output you start output with "```json" and that's it.
As its all inference time it could be plugin based, ie:
1. character based - given input (from the start of opening "```foo") it returns allowed next characters, or
2. token based - same as above but returns allowed native tokens (not sure how performance would behave here, would it be acceptable?)
IMHO also very interesting area would be exploring stable AST representations for programming languages (a'la darklang I believe?) – where variable names are detached from AST itself, ie. differently named functions that otherwise have the same structure have precisely the same AST representation. This would dramatically reduce space to navigate around.
This same technique, extended can work well for detecting plagiarism from the underlying corpus as well, by tracking a trie of "good" completions in the n-gram sense, and a longer trie of "no-good" completions. This technique was (to my knowledge) first shown in [0], and particularly [1] is a really interesting video discussing these topics around max-order grams even in a Markovian setting. I used this technique a bit in symbolic music generation and was quite pleased with the results, always planned to work it into whatever next models.
I think there are a lot of methods from these older Markovian setups that can be employed in the outputs samplers of modern models, as well as the inclusion of structured searches and so on. Parts of deep learning have always focused on structured output search, but historically the LLM style generative setting has not employed these approaches (though I find beam search for generative settings needs tweaking, it usually works pretty well in smaller scale problems for me).
I wonder if anyone has tried giving an LLM a tool that ensures that it copies a quote from a document accurately? It seems like it would be a pretty simple way to avoid some hallucinations.
Do I understand it correctly if the idea here is to insert some logic into the inference loop?
I might not have a clear idea how the inference loop works but it sounds like a whole host of solutions could present themselves if it was easy to plug in various types of logic at inference.
Yes the inference loop provides a list of “logits” every loop which are possible predictions for the next token. LLM parameters like temperature and top_p configure the default logic that selects one of those logits but that logic can be replaced with anything you want. It just has to select the next token and feed it back into inference.
Llama.cpp and its downstream users like ollama have long supported using BNF grammars to constrain the output that way.
This is interesting, but unless I'm misreading the paper, it looks like they're training an LLM on the corpus. I can easily see why that would result in better performance than an off-the-shelf embeddings model, but... it won't work for a corpus that changes frequently, since you'll constantly have to retrain the LLM. That's sort of the point of RAG: how do you get the right information into an LLM as context, for data that changes so frequently that you can't directly train on it?
There are some interesting corpora that I would like to search smartly, that shouldn't change too often. For instance "all Norwegian newspapers printed before 1980".
This was my takeaway as well. None of the other retrieval methods this paper benchmarks against are specifically trained on the corpus. I think it would be more fair to compare "Self-Retrieval" against models which have been fine-tuned on the corpus.
"Specifically, we treat each original sentence in the document as an index and the document itself as the object of the index, allowing the LLM to memorize documents and build indexes through self-supervised learning"
This is clever, I haven't seen an effective way to train an LLM to search a document yet and I can imagine this being very effective. I suppose this relies on the over fitting you get when fine tuning on a very small dataset.
I like methods of this flavour, pioneered as I understand it by Fabio Petroni. Very elegant, particularly because you can change the distribution over substrings in O(params) time instead of O(index size).
What's funny about them is that it's a fairly involved procedure that turns your language model into an actual stochastic parrot, both showing that such a model useful and demonstrating that the original parrot concept was rather ill conceived.
This seems compute intensive for rag utilities or im missing something, i don't really see the application when you could get the same results with sparse embeddings(keywords) then re-ranking. It is promising-with context length issues, amnesia becomes a problem because the retrieved docs often bloats out the context window, this is the right direction for personal llms
i need someone to explain this paper in direct language, because this paper keeps saying things like "internalizes the corpus to retrieve into a LLM via a natural language indexing architecture." but never ever explains what the heck "internalizing" means, and where the "natural language index" sits in the architecture. Figure 2 is the only thing somewhat resembling an arch diagram and just uses these vague undefined terms to describe it. am i missing something or just voicing the same confusion everyone else has?
> They train the LLM directly on the corpus so that the documents are embedded in its weights.
they do not outright say that in the paper as far as i could tell. i only got it from reading hn comments. just very confused why they use a nonstandard term like "internalize" which just pisses me off because ML is hard enough without inventing your own terms
I wonder how would they calculate the metrics if the result is generated instead of retrieved? Is it likely that the LLM can generate exactly the same output as the desired result?
> To accurately generate the exact passages in the given corpus, we employ a trie-based constrained decoding algorithm (Chen et al., 2020; Cao et al., 2021; Lu et al., 2021) in which the generated tokens can be constrained in the dynamic vocabulary. Specifically, instead of generating a token from the entire target vocabulary at each step, we use a prefix tree (trie) to constraint the target vocabulary and ensure that the generated content is within the corpus. During the construction of trie, we remove stop words from the initial token to improve semantic representation of the trie.