This sounds like an interesting idea, can you elaborate more may be with a concrete example. I am wondering if this can be implemented easily as a plugin in optillm.
One could argue TF-IDF is a case of an attention layer... but not quadratic in inference/training and kinda just a quotient. Yeah maybe we should go back
Now have it mark blocks of text on or off, so it can ignore irrelevant, or worse erroneous material — no need to include it in the context window.