My understanding of it is that they are multiple samples of the same underlying ...

My understanding of it is that they are multiple samples of the same underlying distribution (the applicant's subjective characteristics). Multiple references who know/have interacted with the applicant attest to his/her personality strengths & weaknesses, and the average of all of them should result in an accurate measure of the person. If one of your "sensors" consistently measures lower than the grouping of the others, then you look into why that one is giving faulty readings, if it does so over a large sample size.