Why is the L(R_1) + L(R_2) = 100 + 0 at 14:50? I thought that p-hat-c is the proportion of examples in R that are of class C, not the actual number of examples in class C. Based on the misclassification loss definition L(R_1) + L(R_2) = (1 - 700/800) + (1 - 200/200) = 0.125 and L(R_1') + L(R_2') = (1 - 400/500) + (1 - 500/500) = 0.1.
He simply omitted the denominator, 1-700/800 ==> 100/800 ; 1-400/500 ==> 100/500; Not defending him, just explaining his thinking process, but precisely, your calculation is the precise.
I think in this case he is talking about absolute nos of example that they have gotten wrong. Btw you are right it is a bit confusing what he is talking about.
Yeah it’s what @tariqkhan1518 said, we are only looking at the absolute values at that point since this is also his main argument to instead use a different measurement later on
He defines the classification loss as L(R) = 1 - max(p^c) (over all c in C classes) at 10:19. Next, he defines L(parent) -sum(L(children)) , initially stating that it needs to be minimized at 12:05 but later changing it to be maximized at 18:46. However, when comparing the two trees, he disregards the mentioned concepts and simply sums the values that will be misclassified for each child (100 + 0 for the first tree and 100 + 0 for the second) at 15:00. This approach of summing misclassified values without considering proportions is not accurate. Taking proportions into account according to L(R) = 1 - max(p^c), it turns out that the quantity to be maximized won't be the same for the two trees. For the first tree: L(R_1) + L(R_2) = (1 - 700/800) - (1 - 200/200) = 0.125. For the second tree: L(R_1') + L(R_2') = (1 - 400/500) + (1 - 500/500) = 0.1. The reason these values differ is because the quantity he defined as L(parent) - sum(L(children)) is incorrect. To have them both be the same value, the weighted average of the number of points in each should be taken. Thus we ditch misclassification loss and go for cross entropy.
I mean you’re right but that’s exactly his point and it’s why he is criticizing misclassification loss, since there we would only be looking at absolute values. It’s also the reason he proposes cross-entropy loss and it automatically addresses your concerns as well.
@9:40 - MIS-CLASSIFICATION LOSS -- R being REGION , and given Class Count is CAPITAL-C , num classes = CAPITAL-C , he has defined - p-hat-c to be the proportion of examples in Region R , which are of Class NotCapital-C .
shouldn't the cross entropy loss at 20:00 be the negative sum of y log p(hat), where y is its true classification? I'm not sure why p hat appears twice in this equation
These days the level of presentations is so good thanks to opencoursewares. Even videos prepared by some random dudes is quite good due to high efforts in preparing and refining them. This particular lecture might look a little on the downside of today's standards but come on it is still pretty descent. I am pretty sure he would also improve after this, after all experience and practice is the only way
@@nagendranaidu1933 I would recommend Kilian Weinberger, he's a professor at Cornell and has a youtube channel with all his machine learning lectures. He's an excellent teacher, one of the best, imo.
I got a high level understanding of DTs but it stayed too qualitative and hand wavy. Add to it the confusion of the first few minutes, there's little to take away in the entire lecture.
As a PhD student who has taught lectures for my advisor, I will say that sometimes it doesn't matter how much you prepare for the lecture, you will stumble a bit. It's certainly not an amazing lecture, but your comment isn't helpful and only seems ignorant.
@@mrpotatohed4 Providing feedback is absolutely paramount to improve performance. This is helpful for both the lecturer and the students. I would have explained it better and in a much more clearer way if I had spent even half a day learning about it from scratch but this is just pathetic at best.
Why is the L(R_1) + L(R_2) = 100 + 0 at 14:50? I thought that p-hat-c is the proportion of examples in R that are of class C, not the actual number of examples in class C. Based on the misclassification loss definition L(R_1) + L(R_2) = (1 - 700/800) + (1 - 200/200) = 0.125 and L(R_1') + L(R_2') = (1 - 400/500) + (1 - 500/500) = 0.1.
Actually I think you're right,your answer is same as the lecture note of Decision Trees, CS229, 2021
He simply omitted the denominator, 1-700/800 ==> 100/800 ; 1-400/500 ==> 100/500; Not defending him, just explaining his thinking process, but precisely, your calculation is the precise.
I think in this case he is talking about absolute nos of example that they have gotten wrong. Btw you are right it is a bit confusing what he is talking about.
Also please refer to 25:30 in this lecture. He calculates the p(hat) instead of the absolute loss.
Yeah it’s what @tariqkhan1518 said, we are only looking at the absolute values at that point since this is also his main argument to instead use a different measurement later on
He defines the classification loss as L(R) = 1 - max(p^c) (over all c in C classes) at 10:19. Next, he defines L(parent) -sum(L(children)) , initially stating that it needs to be minimized at 12:05 but later changing it to be maximized at 18:46.
However, when comparing the two trees, he disregards the mentioned concepts and simply sums the values that will be misclassified for each child (100 + 0 for the first tree and 100 + 0 for the second) at 15:00. This approach of summing misclassified values without considering proportions is not accurate.
Taking proportions into account according to L(R) = 1 - max(p^c), it turns out that the quantity to be maximized won't be the same for the two trees. For the first tree: L(R_1) + L(R_2) = (1 - 700/800) - (1 - 200/200) = 0.125. For the second tree: L(R_1') + L(R_2') = (1 - 400/500) + (1 - 500/500) = 0.1.
The reason these values differ is because the quantity he defined as L(parent) - sum(L(children)) is incorrect. To have them both be the same value, the weighted average of the number of points in each should be taken. Thus we ditch misclassification loss and go for cross entropy.
Thanks that makes a lot of sense
I mean you’re right but that’s exactly his point and it’s why he is criticizing misclassification loss, since there we would only be looking at absolute values. It’s also the reason he proposes cross-entropy loss and it automatically addresses your concerns as well.
small nitpick, L(R_1') + L(R_2') = (1 - 400/500) + (1 - 500/500) = 0.2 not 0.1. But otherwise i agree with you.
@9:40 - MIS-CLASSIFICATION LOSS -- R being REGION , and given Class Count is CAPITAL-C , num classes = CAPITAL-C , he has defined - p-hat-c to be the proportion of examples in Region R , which are of Class NotCapital-C .
Any updates link to download pdf of the notes. The link given doesn't work.
shouldn't the cross entropy loss at 20:00 be the negative sum of y log p(hat), where y is its true classification? I'm not sure why p hat appears twice in this equation
Yep. This person seems confused between Shannon entropy and cross entropy.
why did the loss of parent in 23:37 become the point projected upwards?.
he explained right after
i have a doubt while defining loss, isnt it more like proportion so its 100/(900+100)
These days the level of presentations is so good thanks to opencoursewares. Even videos prepared by some random dudes is quite good due to high efforts in preparing and refining them. This particular lecture might look a little on the downside of today's standards but come on it is still pretty descent. I am pretty sure he would also improve after this, after all experience and practice is the only way
@15:20 -- Cross Entropy Loss
What is the loss function for Decision Tree?
it is the Entropy or Gini
@@happytrigger3778 and for SVM,XGboost and Random Forest?
@@happytrigger3778 These are from Classification trees , you can also find other loss functions like the **Mean Square Error** for Regression trees
nice lecture, which gives me a big picutre of tree models
In cross entropy what if phat goes to 0. It also decreases the loss?
Yes. When x approaches 0, the limit of xlogx is 0.
It should be decrease in entropy, not cross entropy, imo. You are using the entropy formula, calling it CE
I agree. This is an important thing that might misled alot of people
Thanks so much for this!
Great easy-to-follow lecture.
How to know while partitioning which one is the root?
Andrew is indeed much easier to understand
He has more than 20 years of experience in teaching students lol
There are still better instructors
@@lugia8888 please suggest some..
@@nagendranaidu1933 I would recommend Kilian Weinberger, he's a professor at Cornell and has a youtube channel with all his machine learning lectures. He's an excellent teacher, one of the best, imo.
@@T_SULTAN_ A plus of Kilian is also that he is very funny and engaging - even often unintentionally so... :-D
So many markers 33:30 daamn
I got a high level understanding of DTs but it stayed too qualitative and hand wavy. Add to it the confusion of the first few minutes, there's little to take away in the entire lecture.
40:59
20 minutes in and this is all over the place. Could have been much more thoroughly thought through before the lecture.
As a PhD student who has taught lectures for my advisor, I will say that sometimes it doesn't matter how much you prepare for the lecture, you will stumble a bit. It's certainly not an amazing lecture, but your comment isn't helpful and only seems ignorant.
@@mrpotatohed4 Providing feedback is absolutely paramount to improve performance. This is helpful for both the lecturer and the students. I would have explained it better and in a much more clearer way if I had spent even half a day learning about it from scratch but this is just pathetic at best.
@@HimanshuSharma-ro6ym you said "more clearer" i don't think your reading comprehension would allow for clarity buddy.
In deeplearning we have thing called dropouts whyn't we use them here.
You must be talking about pruning
whyn't isn't a word.
really clear😆
worst lecture till now in the series
i agree
good lecture but waiting for him to finish writing on a white board when ML is done on a computer is incredibly frustrating. what is this 2003?