Bias in the statistical sense is usually E[\hat beta - beta]. By which I mean there’s a specific aspect of this thing I’m trying to get. The whole field of causal inference is based on the fact that if you do things naively, you might mix your signals. Like how linear regression can get you biased or unbiased coefficients in different settings. Sometimes you
need something like IV because just plugging in your data will tell you that ambulances are bad because it indicates the patient will more likely die even after conditioning on everything else catalogued.
It’s not opinions I disagree with, it’s aspects and behavior I don’t want, which is the statistical sense.
Bias in prediction, rather than parameter estimation, is a perfectly well established sense of the term. In particular, people doing language modeling are practically never concerned with identifiability, because you can't pick out one weight out of a trillion parameter model and say what it ought to be in the limit of infinite data.
But when people use the term bias in NLP, that’s what they’re talking about. They don’t want an aspect of the model to do something it ought not do. It’s a case of omitted variable bias causing things like the word analogy issues you hear about. Not an issue of bias in predicting the masked word.
It’s not opinions I disagree with, it’s aspects and behavior I don’t want, which is the statistical sense.