It's easy to recognize a cat 95% of the time. I can write a program in 30 seconds that will recognize a cat 95% of the time. No, wait, this just in! My program will recognize a cat 100% of the time! The program has just one line:
Tutorial: So, with that program, whenever the picture is a cat, the program DOES recognize it. So the program DOES recognize a cat 100% of the time. The OP only claimed 95% of the time.
Uh, we need TWO (2), that's TWO numbers:
conditional probability of recognizing a cat when there is one (detection rate)
conditional probability of claiming there is a cat when there isn't one.
The second is the false alarm rate or the conditional probability of a false alarm or the conditional probability of Type I error or the significance level of the test or the p-value, the most heavily used quantity in all of statistics.
One minus the detection rate is the conditional probability of Type II error.
Typically we can adjust the false alarm rate, and, if we are willing to accept a higher false alarm rate, then we can get a higher detection rate.
With my little program, the false alarm rate is also 100%. So, as a detector, my little program is worthless. But the program does have a 100% detection rate, and that's 5% better than the OP claimed.
If focus ONLY on detection rate, that is, recognizing a cat when there is one, then it's easy to get a 100% detection rate with just a trivial test -- just say everything is a cat as I did.
What's tricky is to have the detection rate high and the false alarm rate low. The best way to do that is in the classic Neyman-Pearson lemma. A good proof is possible using the Hahn decomposition from the Radon-Nikodym theorem in measure theory with the famous proof by von Neumann in W. Rudin, Real and Complex Analysis.
My little program was correct and not a joke.
Again, to evaluate a detector, need TWO, that's two, or 1 + 1 = 2 numbers.
What about a detector that is overall 95% correct? That's easy, too: Just show my detector cats 95% of the time.
If we are to be good at computer science, data science, ML/AI, and dip our toes into a huge ocean of beautifully done applied math, then we need to understand Type I and Type II errors. Sorry 'bout that.
Here is statistical hypothesis testing
101 in a nutshell:
Say, you have a kitty cat
and your vet does a blood count,
say, whatever that is, and gets a number.
Now you want to know if your cat is sick or healthy.
Okay. From a lot of data on what appear to be healthy
cats, we know what the probability distribution is for the blood count number.
So, we make a hypothesis that our cat is healthy. So, with this hypothesis, presto, bingo, we know
the distribution of the number we got. We call this the null hypothesis because we are assuming that the situation is null, that is, nothing wrong, that is, that our cat is healthy.
Now, suppose our number falls way out in a tail of that distribution.
So, we say, either (A) our cat is healthy and we have observed something rare or (B) the rare is too rare for us to believe, and we reject the null hypothesis and conclude that our cat is sick.
Historically that worked great for testing a roulette wheel that was crooked.
So, as many before you, if you think about that little procedure too long, then you start to have questions! A lot of good math people don't believe statistical hypothesis testing; typically if it is their father, mother, wife, cat, son, or daughter, they DO start to believe!
Issues:
(1) Which tail of the distribution, the left or the right? Maybe in some context with some more information, we will know. E.g., for blood pressure for the elderly, we consider the upper tail, that is, blood pressure too high. For a sick patient, maybe we consider blood pressure too low unless they are sick from, say, cocaine in which case we may consider too high. So, which tail is not in the little two set dance I gave. Hmm, purists may be offended, often the case in statistics looked at too carefully! But, again, if it's your dear, total angel of a perfect daughter, then ...!
(2) If we have data on healthy kitty cats, what about also sick ones? Could we use that data? Yes, and we should. But in some real situations all we have a shot at getting is the data on the healthy -- e.g., maybe we have oceans of data on the healthy case (e.g., a high end server farm) but darned little data on the sick cases, e.g., the next really obscure virus attack.
(3) Why the tails at all? Why not just any area of low probability? Hmm .... Partly because we worship at the alter of central tendency?
Another reason is a bit heuristic: By going for the tails, for any selected false alarm rate, we maximize the area of our detection rate.
Okay, then we could generalize that to multidimensional data, e.g., as might get from several variables from a kitty cat, dear, angel perfect daughter, or a big server farm. That is, the distribution of the data in the healthy case looks like the Catskill Mountains. Then we pour in water to create lakes (assume they all seek the same level). The false alarm rate is the probability of the ground area under the lakes. A detection is a point in a lake. For a lower false alarm rate, we drain out some of the water. We maximize the geographical area for the false alarm rate we are willing to tolerate.
Well, I cheated -- that same nutshell also covers some of semester 102.
For more, the big name is E. Lehmann, long at Berkeley.