Text embeds don't capture inferred data, like "second letter of this text" does ...

derefr · 2024-07-24T15:25:10 1721834710

Given current SOTA, no, they don’t.

But there’s no reason why they couldn’t — just capture the vectors of some of the earlier hidden layers during the RAG encoder’s inference run, and append these intermediate vectors to the final embedding vector of the output layer to become the vectors you throw into your vector DB. (And then do the same at runtime for embedding your query prompts.)

Probably you’d want to bias those internal-layer vectors, giving them an increasingly-high “artificial distance” coefficient for increasingly-early layers — so that a document closely matching in token space or word space or syntax-node space improves its retrieval rank a bit, but not nearly as much as if the document were a close match in concept space. (But maybe do something nonlinear instead of multiplication here — you might want near-identical token-wise or syntax-wise matches to show up despite different meanings, depending on your use-case.)

Come to think, you could probably build a pretty good source-code search RAG off of this approach.

(Also, it should hopefully be obvious here that if you fine-tuned an encoder-decoder LLM to label matches based on criteria where some of those criteria are only available in earlier layers, then you’d be training pass-through vector dimensions into the intermediate layers of the encoder — such that using such an encoder on its own for RAG embedding should produce the same effect as capturing + weighting the intermediate layers of a non-fine-tuned LLM.)