Another good explanation of F1 score is that it considers FN and FP equally like: Tp/(TP + 1/2(FN+FP)). So this is another way how it's a goos mix of precision and recall.
I couldnt get one thing , putting here if you are able to answer please ---> ' in the Slide of 'Precision' , you mentioned that it solves problem with Accuracy cheating , where if 99% of people are healthy and 1% unhealthy , and if your model predict all as 'Healthy' then Precision would be 0 , but how ? lets say 100 people , 99 healthy(positive), 1 unhealthy (negative) , prediction made by model as All Positive (cheated) , thus 99 (True-Positive) , 1 ( False-Positive) , Precision = TP / ( TP +FP ) = 99 / ( 99 + 1 ) = 0.99 = 99% right ? how 0 ?
Your confusion is defining what positive and negative is. If you are "looking for" the unhealthy case, then unhealthy = positive. Thus, TP = 0, in your example. Hope that helps.
As Scarletts noted in their reply, you have the "unhealthy" and 'healthy' classes mixed up. If you see one of the early slides about terminologies, the class we are interested in is the the one we called positive, and the one we aren't interested in is the 'negative class'. In this case of infected and uninfected people, are interested in the infected people, hence this is the positive class. So we the accuracy metric, we could cheat by just predicting the negative class (all healthy people of which we have 99 of them) and still achieve an accuracy of 99 which is very high but will be misleading We Precision, defined as TP/(TP + FP), if we predict all negatives, then there are no true positives, as for TP (model and label matches) but here model predicts all negative. Consequently TP = 0, and hence precision is 0. Hopefully this comment adds more details and why the precision is 0.
@@ScarlettsLog I have a further question on this if you can help. With Healthy=Negative & Unhealthy=Positive; if the model predicts all as Healthy (negative), then TP=0, FP=0. So Precision = TP / (TP + FP) = 0 / 0 = undefined. How can we call it 0% precise? Thanks. On the same lines, in the slide @6:35, you say "Cheating precision 100% => 0% recall". How can the model cheat to be 100% precise? For that, TP must be = (TP + FP) != 0. If the model makes all negative guesses, then Precision = TP/(TP+FP) = 0/0 = undefined again as above. If the model makes all positive guesses, then Precision = some fraction that is not necessarily 100%.
@@MenTaLLyMenTaL You are correct, they would be considered undefined, it is a special case/edge case that defaults to 0. For anyone else confused, I will provide two case studies to showcase how we would normally assume precision/recall to operate and how they operate in edge cases. Say we have a dataset of 100 people, 1 person is unhealthy and 99 are healthy. If we consider unhealthy = Positive and Healthy = Negative, we can see that: True positive here means when we guess that someone is unhealthy, and they are actually indeed unhealthy. False positive here means when we guess that someone is unhealthy, but they are actually healthy. True negative here means when we guess that someone is healthy, and they are actually healthy. False negative here means when we guess that someone is healthy, but they are actually unhealthy. Now let's do some experiments. If we have a model that predicts everyone as unhealthy, what will the precision and recall scores be? Remember, Our dataset has 99 people healthy and 1 one unhealthy. Precision: (TP)/(TP + FP) For precision, since our model predicts everyone as unhealthy, we correctly guess one person who is unhealthy as unhealthy (TP value of 1) and guess 99 people who are actually healthy as unhealthy (FP value of 99). Thus precision is: 1/100 = 0.01% Recall: (TP)/(TP + FN) For recall, since our model predicts everyone as unhealthy, we correctly guess one person who is unhealthy as unhealthy (TP value of 1) and we then guess 99 people who are actually healthy as unhealthy. As we saw in recall, this incorrect guessing of 99 people are false positive (FP) cases. This means that we just don't make any FN guesses (defined as when we guess someone as healthy but they are actually unhealthy) and thus we get an FN of 0. This then turns into 1/(1+0) which is 1 or 100%. Now let's do another experiment. Let's have our model predict everyone as healthy. Precision: (TP)/(TP + FP) TP assumes we guess as unhealthy and they are actually unhealthy. Since we predict everyone as healthy, we get a True Positive score of 0. There is 1 person who is actually unhealthy, but we are always guessing "healthy", so we just never make any True Positive guesses. Now for FP, we also get a score of 0. FP assumes we guess someone as unhealthy but they are actually healthy, since we never even guess that someone is unhealthy, we again just make no False Positive guesses. Thus, we get precision = 0/0. This is a special case and has been noted in implementations to then default to 0 as the output. Since YT deletes links, search up sklearn.metrics.precision_recall_fscore_support and click on the first link and go to the notes section. However, in this specific scenario we've devised, we could consider precision to be 100% because we never false guess someone as unhealthy when they are actually healthy (FP = 0), as a response. Basically, when this situation happens, a library or framework will report precision to be 0, but as ML practicioners, you would need to see what this undefined precision output means in the context of your problem. Recall: (TP)/(TP + FN) By now, I will just state that TP is 0. It is the same reasoning as above. Remember False Negative (FN) happens when we guess someone as healthy but they are actually unhealthy. This happens 1 time, we guess the sick person as being healthy. Thus we get 0/(0 + 1) = 0 for recall.
Nice explanations. When you say negative labels are left out of picture in the precision calculation, I don’t quite get it because precision formula contains the term FP which represents a quantity from negative labels. Am I out of the track?
Leaving it out of the picture refers only to the guesses that the model made. Yes, you need the negative labels to check if it was an FP guess, but it does not concern your own pool of negative guesses. Hope that makes sense, and sorry if the wording in the slides caused confusion.
ENJOYED THE VIDEO - keep up the great work ! I DO find myself with one question though... what about the "NULL" case did it actually get considered? In one example you describe a type B as the "positive" case amongst types A, B, C and D; this makes SENSE, BUT... ... you then describe the "negatives" as being ONLY A, C and D... but what about the case of "NOT A, B, C OR D" ? This is the "null" set, and it occurs quite regularly in Computer Vision OBJECT DETECTION models. IMAGINE THIS SCENARIO: An image of FRUIT on a TABLE - 3 red apples, 1 yellow apple, and 2 bananas. SAY you're looking to detect APPLES, but your MODEL only detects the 3 RED APPLES and UTTERLY MISSES the 1 YELLOW APPLE? THIS is a case of there BEING an APPLE, but you DIDN'T MIS-CLASSIFY it as the BANANA... YOU MISSED DETECTING IT COMPLETELY ! HOW would you describe THIS common scenario if you're only considering/assuming that your model WILL classify EVERYTHING in the image as EITHER an APPLE or a BANANA... but you DIDN'T expect it to UTTERLY IGNORE the yellow apple altogether? It's been 2 years since you posted, so I'm not expecting a reply; I AM hoping that other viewers will ponder the explanations presented - there's a bit more going on... Cheers, -Mark Vogt, Solution Architect/Data Scientist - AVANADE
I think you hit on a very important point, which also doubly highlights why the F1 score is rarely used for multi-class predictions. If only considering the binary case, the issues you highlight go away; over and including 0.5 in positive, and the rest are negative. This is a great reason to investigate metrics that are better suited for the multi-class case, even if accuracy and F1 give you "reasonable" results.
I'm not a data scientist but... Aren't we calculating the precision, recall, accuracy etc of the classifications that are *made* here? I think you could always create confusion matrix for each object on a binary bases, "detected" and "not detected", so that you can measure the accuracy/precision etc of the "detection" capability of the model, and not the classifications of the objects. Because these two feels like different things to measure.
These concepts "clicked" for me thanks to this video, really great work and thank you so much for sharing.
Many thanks for this intuitive yet detailed and well-reasoned explanation!
Awesome explanation, thank you very much. It shines because it gives the motivation for each metric. Thanks!
Probably the best explanation on the net.
Another good explanation of F1 score is that it considers FN and FP equally like:
Tp/(TP + 1/2(FN+FP)). So this is another way how it's a goos mix of precision and recall.
Great explanation because it includes motivation for the metrics and how they relate to each other.
Really clear explanation, well done.
this is sooooo helpful!! thank you!
I couldnt get one thing , putting here if you are able to answer please ---> ' in the Slide of 'Precision' , you mentioned that it solves problem with Accuracy cheating , where if 99% of people are healthy and 1% unhealthy , and if your model predict all as 'Healthy' then Precision would be 0 , but how ? lets say 100 people , 99 healthy(positive), 1 unhealthy (negative) , prediction made by model as All Positive (cheated) , thus 99 (True-Positive) , 1 ( False-Positive) , Precision = TP / ( TP +FP ) = 99 / ( 99 + 1 ) = 0.99 = 99% right ? how 0 ?
Your confusion is defining what positive and negative is. If you are "looking for" the unhealthy case, then unhealthy = positive. Thus, TP = 0, in your example. Hope that helps.
As Scarletts noted in their reply, you have the "unhealthy" and 'healthy' classes mixed up. If you see one of the early slides about terminologies, the class we are interested in is the the one we called positive, and the one we aren't interested in is the 'negative class'. In this case of infected and uninfected people, are interested in the infected people, hence this is the positive class.
So we the accuracy metric, we could cheat by just predicting the negative class (all healthy people of which we have 99 of them) and still achieve an accuracy of 99 which is very high but will be misleading
We Precision, defined as TP/(TP + FP), if we predict all negatives, then there are no true positives, as for TP (model and label matches) but here model predicts all negative. Consequently TP = 0, and hence precision is 0.
Hopefully this comment adds more details and why the precision is 0.
@@jeromeeusebius i get it , Thanks for spending your time in writting this all, i appreciate it !!!
@@ScarlettsLog I have a further question on this if you can help. With Healthy=Negative & Unhealthy=Positive; if the model predicts all as Healthy (negative), then TP=0, FP=0. So Precision = TP / (TP + FP) = 0 / 0 = undefined. How can we call it 0% precise? Thanks.
On the same lines, in the slide @6:35, you say "Cheating precision 100% => 0% recall". How can the model cheat to be 100% precise? For that, TP must be = (TP + FP) != 0. If the model makes all negative guesses, then Precision = TP/(TP+FP) = 0/0 = undefined again as above. If the model makes all positive guesses, then Precision = some fraction that is not necessarily 100%.
@@MenTaLLyMenTaL You are correct, they would be considered undefined, it is a special case/edge case that defaults to 0. For anyone else confused, I will provide two case studies to showcase how we would normally assume precision/recall to operate and how they operate in edge cases. Say we have a dataset of 100 people, 1 person is unhealthy and 99 are healthy. If we consider unhealthy = Positive and Healthy = Negative, we can see that:
True positive here means when we guess that someone is unhealthy, and they are actually indeed unhealthy.
False positive here means when we guess that someone is unhealthy, but they are actually healthy.
True negative here means when we guess that someone is healthy, and they are actually healthy.
False negative here means when we guess that someone is healthy, but they are actually unhealthy.
Now let's do some experiments. If we have a model that predicts everyone as unhealthy, what will the precision and recall scores be? Remember, Our dataset has 99 people healthy and 1 one unhealthy.
Precision: (TP)/(TP + FP)
For precision, since our model predicts everyone as unhealthy, we correctly guess one person who is unhealthy as unhealthy (TP value of 1) and guess 99 people who are actually healthy as unhealthy (FP value of 99). Thus precision is: 1/100 = 0.01%
Recall: (TP)/(TP + FN)
For recall, since our model predicts everyone as unhealthy, we correctly guess one person who is unhealthy as unhealthy (TP value of 1) and we then guess 99 people who are actually healthy as unhealthy. As we saw in recall, this incorrect guessing of 99 people are false positive (FP) cases. This means that we just don't make any FN guesses (defined as when we guess someone as healthy but they are actually unhealthy) and thus we get an FN of 0. This then turns into 1/(1+0) which is 1 or 100%.
Now let's do another experiment. Let's have our model predict everyone as healthy.
Precision: (TP)/(TP + FP)
TP assumes we guess as unhealthy and they are actually unhealthy. Since we predict everyone as healthy, we get a True Positive score of 0. There is 1 person who is actually unhealthy, but we are always guessing "healthy", so we just never make any True Positive guesses. Now for FP, we also get a score of 0. FP assumes we guess someone as unhealthy but they are actually healthy, since we never even guess that someone is unhealthy, we again just make no False Positive guesses. Thus, we get precision = 0/0. This is a special case and has been noted in implementations to then default to 0 as the output. Since YT deletes links, search up sklearn.metrics.precision_recall_fscore_support and click on the first link and go to the notes section. However, in this specific scenario we've devised, we could consider precision to be 100% because we never false guess someone as unhealthy when they are actually healthy (FP = 0), as a response. Basically, when this situation happens, a library or framework will report precision to be 0, but as ML practicioners, you would need to see what this undefined precision output means in the context of your problem.
Recall: (TP)/(TP + FN)
By now, I will just state that TP is 0. It is the same reasoning as above. Remember False Negative (FN) happens when we guess someone as healthy but they are actually unhealthy. This happens 1 time, we guess the sick person as being healthy. Thus we get 0/(0 + 1) = 0 for recall.
Great explanation, thank you very much
Excellent explanation
Nice explanations. When you say negative labels are left out of picture in the precision calculation, I don’t quite get it because precision formula contains the term FP which represents a quantity from negative labels. Am I out of the track?
Leaving it out of the picture refers only to the guesses that the model made. Yes, you need the negative labels to check if it was an FP guess, but it does not concern your own pool of negative guesses. Hope that makes sense, and sorry if the wording in the slides caused confusion.
Great video!
Good work 👍🧡
Loved it!, thanks a lot!
very well explained
Thank you
Love you!
tnks
Good stuff
ENJOYED THE VIDEO - keep up the great work !
I DO find myself with one question though... what about the "NULL" case did it actually get considered?
In one example you describe a type B as the "positive" case amongst types A, B, C and D; this makes SENSE, BUT...
... you then describe the "negatives" as being ONLY A, C and D... but what about the case of "NOT A, B, C OR D" ?
This is the "null" set, and it occurs quite regularly in Computer Vision OBJECT DETECTION models.
IMAGINE THIS SCENARIO:
An image of FRUIT on a TABLE - 3 red apples, 1 yellow apple, and 2 bananas.
SAY you're looking to detect APPLES, but your MODEL only detects the 3 RED APPLES and UTTERLY MISSES the 1 YELLOW APPLE?
THIS is a case of there BEING an APPLE, but you DIDN'T MIS-CLASSIFY it as the BANANA... YOU MISSED DETECTING IT COMPLETELY !
HOW would you describe THIS common scenario if you're only considering/assuming that your model WILL classify EVERYTHING in the image as EITHER an APPLE or a BANANA... but you DIDN'T expect it to UTTERLY IGNORE the yellow apple altogether?
It's been 2 years since you posted, so I'm not expecting a reply; I AM hoping that other viewers will ponder the explanations presented - there's a bit more going on...
Cheers,
-Mark Vogt, Solution Architect/Data Scientist - AVANADE
I think you hit on a very important point, which also doubly highlights why the F1 score is rarely used for multi-class predictions. If only considering the binary case, the issues you highlight go away; over and including 0.5 in positive, and the rest are negative. This is a great reason to investigate metrics that are better suited for the multi-class case, even if accuracy and F1 give you "reasonable" results.
I'm not a data scientist but... Aren't we calculating the precision, recall, accuracy etc of the classifications that are *made* here? I think you could always create confusion matrix for each object on a binary bases, "detected" and "not detected", so that you can measure the accuracy/precision etc of the "detection" capability of the model, and not the classifications of the objects. Because these two feels like different things to measure.
Awesome! I love the cheating cases brought in!
Excellent explanation