This is very clear explanation better than most of the articles that I could find online, thanks! I have one question though: when getting the global shapley value (average across all the instances), why do we sum up the absolute value of the Shapley value of all the instances? Is it how we need to keep the desirable properties of the Shapley value? Is there any meaning of summing up the plain value of the Shapley value (e.g. positive and negative will now cancel off each other)? Another question is, when you said the expected value of the difference, is it just an arithmetic average of all the difference from all those permutations? I remember seeing something that Shapley value is actually the "weighted" average of the difference, which is related to the ordering of those features. Is the step 1 already taking into this into consideration, such that we only need the arithmetic average to get the final Shapley value for that instance?
Hi Rithvik, Really a nice video. Please cover advanced concepts like Fast gradient sign method . Ur way of explaining those concepts would be really helpful for everyone.
13:59 I did not fully understand how the values in the chart give you the contribution of variables to difference b/w given and avg prediction. I think what you were doing all along was take the difference in predictions b/w two vectors (x1 and x2) you generated from an OG vector and a randomly chosen vector from data. How does this give you the difference in prediction from OG vector and the mean cones sold (which is what you started with)?
Thank you for your explanation but with the SHAP library, one only gives the trained model without the training set. How the sampling from the original dataset can be done with only the trained model ?
The SHAP function explainer expects a data set input called "background data". Is this the data set used to create the "Frankenstein" Vectors explained in the video?
i would say that this vid points out the fact that most of the ML tools are black boxes; but now, people want ' black boxes' to be explained! it's a pb you don't have when you use statistics and/or econometrics as to me it's rather curious to calculate an average value in models that are supposed to be non linear; well in ann there is the sensitivity ( based on the gradient); can be a good start of course, but one have to be cautious
How well do Shapley values align with the composition of various Principal Components? Is there a mathematical relationship between the two, or is it just wholly dependent on the features of the dataset?
Thank you for the great explanation but I have one doubt here, how we get 200 there for temperature ? you said it is the expected difference so say when we run the sample 100 time and each time we get some difference so how that 200 number came out from those 100 difference , did we took average or what math's we applied there? Any response on this would be highly appreciated.
great. How doy think we can apply it for stacking where we can create a stacknet of network of multiples layers with multiple models and for big data problems cuz this approach is based in monte Carlo to "approximate" the shapley values?
Shapley values were also taught in the AI for Medicine specialization online. There, it was intended for use with individual patients as opposed to groups or aggregates of patients. You would use Shapley to make individualized prognoses for patients, like what is the best course of treatment for this specific individual patient. Clearly valuable information, however it was super computationally expensive, requiring all permutations to have a different model trained. Therefore only the simplest of model was used, particularly linear regression. I have not yet watched Ritvikmath's video, and I'm curious how much different his material is from the AI for Medicine courses.
In this video there was only one model trained. Inferencing (predicting) was re-run as many times as needed with different inputs to the same trained model. Very interesting. Much more efficient, but I'm wondering about the correctness and if it's solving a slightly different problem than in the AI for Med course --- not sure.
Thank you for the video, really appreciate it! I have a question about Step3: Is it necessary to 'undo' the permutation after creating the Frankenstein Samples and before feeding them in the model, since the model expects Temp to be in the first position from the training? Thank you very much for clarification
So, if you had x features, say 50, instead of 4, would you randomly subset 15 (half) of them and create x1...x25? And in each of these x1...25, the differences will be that feature 1:i will be conditioned on the random vector whereas feature[i+n] will not be conditioned on the random vector? Trying to visualize what happens when more than 4 features are available.
In Idea-1 slide: Aren't we getting more composite effect instead of isolated effect? As the feature is correlated, the second order interactions with other features is also lost by randomly sampling on this dimension.
Bhai, haven't finished this video but I am sure it's gonna be informative like all of your DS videos that I have watched. Just curious, why have you tattooed Mumbai's coordinates on your arm? :D
I like both the whiteboard and the paper. But I think it's even better to use something like a Powerpoint because it lets you reveal only important information at that moment, hiding future information which can distract you.
Hello. In a linear regression model are SHAP values equivalent to the partial R^2 for a given variable? Don't they take into account the variance as the p-values do?
we stored difference in array or list after step 3 (must be many values). How can SHAP at T=80 can be a single value(200) in your example. Did we take average of that? this E(diff) value how it can be a single value basically?
OK, SHAP is better than PDP but... What are the advantages of SHAP vs LIME (Local Interpretable Model Agnostic Explanation) and ALE (Accumulated Local Effects)?
Hey Ritvikmath, great video as always. I have a doubt, on step 5 the contributions of each of the features adds up to the difference btw the actual and predicted values. Will they always add up perfectly?
I have the same question! I don't think they do. We are randomly creating the Frankenstein samples and taking the difference in their outputs, then doing this many many times and finding the average difference. This gives the Shapley value of just one feature for that sample. Because of the random nature of this process, and because this is done for each feature separately from the other features, I don't think the sum of the Shapley values for each feature necessarily add up to the difference between the expected and the sample output.
Please note that this method approximates the Shapley values, so I'd not expect the efficiency property to hold. If you were to compute exactly the Shapley values, their sum would certainly amount to the difference between the predicted value and the average response. However, the exact computation involves powersets (which increase exponentially w.r.t. the number of features), so we have to settle with approximations.
what is the difference between the work done by Shapley value and the feature selection technique(filter,wrapper and embedded method)? aren't both of them trying to find the best feature?
Hey Ritvikmath, grateful for your content. Wanted to ask you how many data science / machine learning methods someone needs to know to start a career in data science ? I know the more the better lol
Question: Which of the following two questions is the shown algorithm really answering: "How much does Temp=80 contribute to the prediction FOR THIS PARTICULAR EXAMPLE vs mean prediction?" versus "How much does Temp=80 contribute to the prediction FOR ALL REALISTIC EXAMPLES vs mean prediction?" Is there a link to the source reference used by Ritvikmath here? Thanks!
Hello, Ritvik! Thank you for the video! The marker style works great! I'm curious, how to deal with the situation when a feature can have a great importance, but we lack of observations? Following the Ice-cream example, let's add a feature for the time of the day (ToD). And let assume for some reason, that 03:00AM-04:00AM there is a line of airport workers and passengers willing to buy. If we operate the shop at that time, we could sell 5000 cones in one hour regardless other features values. But among our observations there are only working hours (9AM-5PM), and the importance of this feature is quite low. It may sound as an imaginary problem, but in medicine field for rare diseases that's the case.
these are my two cents. You can't use that that are outside of your training data. Mainly because the prediction would not be reliable and as a consequence your explanation won't be reliable either. Let's remember that one of the assumptions of any machine learning model is that the production data must come from the same distribution of our training data. Hence using data for which you have no observations whatsoever would be dangerous. Different is the case in which you have very few data but you still have something. In that case I think you can still be able to solve the problem
Thank you very much for the intuitive and clear explanation! One question is, so is Step1~5 basically the classic Shapley value and is Step6 SHAP (Shapley Additive exPlanation )?
Thanks for the informative video. I just have one issue, I thought Shapley values measure the impact of feature absence. Is this correct? If so, how this was realized here?
Hi, i know it's late for you but I want to give my understanding in case someone else will have the same question. We are realizing this because we are taking different samples. Hence the interested feature will be random hence it won't provide any meaningful information. I'm not 100% sure of this though
I would assume that for a classification problem, the approach remains the same. The only thing that differs for the classification problem, is that you would choose and observe the prediction for a single class value.
Hi, i know it's late for you but I want to give my understanding in case someone else will have the same question. Instead of considering the class as the output we can use the exact same concept by taking the probabilities generated by the last softmax layer (in case of a nn or any probabilistic like model) Or eventually I think we can compute that probability by checking how many times that class has been "outputted"
Excellent explanation! The only question I have is that, sure, in practice you can (and probably should) probably calculate all these through random sampling of feature interactions (random permutations from step 1) because as the number of features increases, you would have a exponentially increasing number of feature interactions to have to be handled, rendering random sampling of features as the only viable method. My question is wouldn't you have to iterate through all possible feature interactions and all data set points for each in order to calculate exact Shapley values? In other words, is the method you proposed just an approximation of the correct values?
i know it's late but this is my understanding of it in case someone else has the same question. Yes, we are getting an approximation of the correct values. But if the sample is large enough and considering that we are taking the expected value, according to the law of big numbers we are pretty confident to get an appropriate estimation of the measure
Hello Ritvik, I love your videos. I was wondering if there is a way to contact you. I had a couple questions about learning data science. Hope to hear from you soon, thank you.
Let me try to put it into my own words. In order to make it easy to understand, I have to simplify it by lying first. So here's a soft lie version: you have a sample with temperature 80, you replace it by a temperature from a random sample. So if the random sample has temperature of 70, then replace 80 by 70. Then you ask a question "If I convert this 70 back to 80, what will be the predicted difference?" If the difference is positive, it means the temperature of 80 is increasing prediction value. If it's negative, it's decreasing the prediction value. And this difference is called the SHAP value. We call a feature with large absolute SHAP value as important. Now let's fix the lie a little bit: instead of only replacing the temperature, we also replace a few other features from the random sample to the original sample. But we still only try to convert back the temperature. Then we average the SHAP value by doing many random sampling to reduce variance. Another thing to do even more is to calculate SHAP value for every sample, then you will have a global SHAP value instead of a local SHAP for a specific sample. So this is pretty much an intense iterative process. And that's it done.
No fancy tools, yet you are so effective!!
You must know that you provide deeper insights that even the standard books do not.
Appreciated!
Thank you so much! This is literally the best video about SHAP on TH-cam!
Great explanation!! Love how you managed to explain the concept so simply! ❤️
Thanks!
by far the best channel ever.. on youtube.. i can watch any video from this channel and feel time well spent 😊😊
Most underrated channel ever
one of the most, but yes
Thank you for the drawing and intuition explanation, which really help me understand Shapley value.
Thank you, it is because of nice teachers like you, I can learn very much everything from utube
So nice of you
I prefer the marker pen style. Here, my complete focus is on the paper in focus and not the surrounding region.
Thanks for the feedback!!
one of the easiest + thorough explanation thank you
Hats off to you. Understood most of the explanability techniques
Glad to hear that
Great video. The whiteboard is the better because of all the non-verbal communication: facial expressions, gestures,...
+1
Awesome video, I don't have a preference on paper or whiteboard just keep the vids coming! First time I learn about Shapley Values, thank you for that
Wonderful explanation! You explained a very difficult concept simply and concisely! Thanks
very crisp explanation. liked it
Nicely explained. Thanks!
Thanks!
This is very clear explanation better than most of the articles that I could find online, thanks! I have one question though: when getting the global shapley value (average across all the instances), why do we sum up the absolute value of the Shapley value of all the instances? Is it how we need to keep the desirable properties of the Shapley value? Is there any meaning of summing up the plain value of the Shapley value (e.g. positive and negative will now cancel off each other)?
Another question is, when you said the expected value of the difference, is it just an arithmetic average of all the difference from all those permutations? I remember seeing something that Shapley value is actually the "weighted" average of the difference, which is related to the ordering of those features. Is the step 1 already taking into this into consideration, such that we only need the arithmetic average to get the final Shapley value for that instance?
Great video, simple and easily comprehensible
Thanks!
This explanation was beautiful 🥲
Thank you for a very well-explained video on Shapley values :D. It helped me.
what a great video! such a simple and effective explanation. Thank you very much for that
This format is great
Thanks!
Hi Rithvik, Really a nice video. Please cover advanced concepts like Fast gradient sign method . Ur way of explaining those concepts would be really helpful for everyone.
13:59 I did not fully understand how the values in the chart give you the contribution of variables to difference b/w given and avg prediction. I think what you were doing all along was take the difference in predictions b/w two vectors (x1 and x2) you generated from an OG vector and a randomly chosen vector from data. How does this give you the difference in prediction from OG vector and the mean cones sold (which is what you started with)?
Thank you for your explanation but with the SHAP library, one only gives the trained model without the training set. How the sampling from the original dataset can be done with only the trained model ?
The SHAP function explainer expects a data set input called "background data". Is this the data set used to create the "Frankenstein" Vectors explained in the video?
i would say that this vid points out the fact that most of the ML tools are black boxes; but now, people want ' black boxes' to be explained! it's a pb you don't have when you use statistics and/or econometrics
as to me it's rather curious to calculate an average value in models that are supposed to be non linear; well in ann there is the sensitivity ( based on the gradient); can be a good start of course, but one have to be cautious
Thanks for your notes!
How well do Shapley values align with the composition of various Principal Components? Is there a mathematical relationship between the two, or is it just wholly dependent on the features of the dataset?
8:16 I don't get Step 2. It seems you're lucky to get H = 8. What if the second sample is [200, 5, 70, 7]?
Why is H=8 a lucky thing? H can be anything. The original H is 4. The new H is 8. Just the fact that it changes is what's important.
@@offchan so the H value for form vectors is from the random sample ??
@@harshavardhanachyuta2055 yes
Beautifully Explained!
Thanks!
Beautiful Explanation.
Thanks!
Thank you for the great explanation but I have one doubt here, how we get 200 there for temperature ? you said it is the expected difference so say when we run the sample 100 time and each time we get some difference so how that 200 number came out from those 100 difference , did we took average or what math's we applied there?
Any response on this would be highly appreciated.
great. How doy think we can apply it for stacking where we can create a stacknet of network of multiples layers with multiple models and for big data problems cuz this approach is based in monte Carlo to "approximate" the shapley values?
Shapley values were also taught in the AI for Medicine specialization online. There, it was intended for use with individual patients as opposed to groups or aggregates of patients. You would use Shapley to make individualized prognoses for patients, like what is the best course of treatment for this specific individual patient. Clearly valuable information, however it was super computationally expensive, requiring all permutations to have a different model trained. Therefore only the simplest of model was used, particularly linear regression. I have not yet watched Ritvikmath's video, and I'm curious how much different his material is from the AI for Medicine courses.
In this video there was only one model trained. Inferencing (predicting) was re-run as many times as needed with different inputs to the same trained model. Very interesting. Much more efficient, but I'm wondering about the correctness and if it's solving a slightly different problem than in the AI for Med course --- not sure.
What does it mean in your example that SHAP is a "local" explanation?
Definitely the marker pen style!
Thank you for the video, really appreciate it!
I have a question about Step3:
Is it necessary to 'undo' the permutation after creating the Frankenstein Samples and before feeding them in the model, since the model expects Temp to be in the first position from the training?
Thank you very much for clarification
So, if you had x features, say 50, instead of 4, would you randomly subset 15 (half) of them and create x1...x25? And in each of these x1...25, the differences will be that feature 1:i will be conditioned on the random vector whereas feature[i+n] will not be conditioned on the random vector? Trying to visualize what happens when more than 4 features are available.
Funny, I was part of a project which dealt with this exact thing!
Cool!
In Idea-1 slide: Aren't we getting more composite effect instead of isolated effect? As the feature is correlated, the second order interactions with other features is also lost by randomly sampling on this dimension.
Excellent explaination, thanks.
nice explanation! loved it!
Bhai, haven't finished this video but I am sure it's gonna be informative like all of your DS videos that I have watched. Just curious, why have you tattooed Mumbai's coordinates on your arm? :D
I like both the whiteboard and the paper. But I think it's even better to use something like a Powerpoint because it lets you reveal only important information at that moment, hiding future information which can distract you.
If I am not mistaken isn't this how you calculate SHAP values, not Shapley values?
Hello.
In a linear regression model are SHAP values equivalent to the partial R^2 for a given variable?
Don't they take into account the variance as the p-values do?
we stored difference in array or list after step 3 (must be many values). How can SHAP at T=80 can be a single value(200) in your example. Did we take average of that? this E(diff) value how it can be a single value basically?
I haven't understood how you decide what variables to keep fixed and what to change.
Imagine you get the permutation [F,T,D,H] or [F,H,D,T]
OK, SHAP is better than PDP but...
What are the advantages of SHAP vs LIME (Local Interpretable Model Agnostic Explanation) and ALE (Accumulated Local Effects)?
Great explanation! Thanks a lot
Hey Ritvikmath, great video as always. I have a doubt, on step 5 the contributions of each of the features adds up to the difference btw the actual and predicted values. Will they always add up perfectly?
I have the same question!
I don't think they do. We are randomly creating the Frankenstein samples and taking the difference in their outputs, then doing this many many times and finding the average difference. This gives the Shapley value of just one feature for that sample. Because of the random nature of this process, and because this is done for each feature separately from the other features, I don't think the sum of the Shapley values for each feature necessarily add up to the difference between the expected and the sample output.
Please note that this method approximates the Shapley values, so I'd not expect the efficiency property to hold. If you were to compute exactly the Shapley values, their sum would certainly amount to the difference between the predicted value and the average response. However, the exact computation involves powersets (which increase exponentially w.r.t. the number of features), so we have to settle with approximations.
what is the difference between the work done by Shapley value and the feature selection technique(filter,wrapper and embedded method)? aren't both of them trying to find the best feature?
Hey Ritvikmath, grateful for your content. Wanted to ask you how many data science / machine learning methods someone needs to know to start a career in data science ? I know the more the better lol
hello, I guess it should be combination instead of permutation according to the coalitional game theory where SHAP method originates
Question: Which of the following two questions is the shown algorithm really answering: "How much does Temp=80 contribute to the prediction FOR THIS PARTICULAR EXAMPLE vs mean prediction?" versus "How much does Temp=80 contribute to the prediction FOR ALL REALISTIC EXAMPLES vs mean prediction?" Is there a link to the source reference used by Ritvikmath here? Thanks!
Can you put out similar content for other interpretable techniques like PDP, ICE etc.
Good suggestion! As a start, you can check out my PDP video linked in the description of this video!
Very interesting video, Ritvik. Also very curious about your tattoo.
Hello, Ritvik! Thank you for the video! The marker style works great! I'm curious, how to deal with the situation when a feature can have a great importance, but we lack of observations? Following the Ice-cream example, let's add a feature for the time of the day (ToD). And let assume for some reason, that 03:00AM-04:00AM there is a line of airport workers and passengers willing to buy. If we operate the shop at that time, we could sell 5000 cones in one hour regardless other features values. But among our observations there are only working hours (9AM-5PM), and the importance of this feature is quite low.
It may sound as an imaginary problem, but in medicine field for rare diseases that's the case.
these are my two cents.
You can't use that that are outside of your training data. Mainly because the prediction would not be reliable and as a consequence your explanation won't be reliable either.
Let's remember that one of the assumptions of any machine learning model is that the production data must come from the same distribution of our training data. Hence using data for which you have no observations whatsoever would be dangerous.
Different is the case in which you have very few data but you still have something. In that case I think you can still be able to solve the problem
@@justfacts4523 Thank you very much! Your content is the best!
I didn't get the part of how he got the 2000, c^
Great video as always !
Would prefer more pen style
Difference between adj r2 and shapley?
simply aswesome!
Could you create a video on Gain and Lift Charts. That would be really helpful.
Thank you very much for the intuitive and clear explanation! One question is, so is Step1~5 basically the classic Shapley value and is Step6 SHAP (Shapley Additive exPlanation )?
Thanks for the informative video.
I just have one issue, I thought Shapley values measure the impact of feature absence. Is this correct? If so, how this was realized here?
Hi, i know it's late for you but I want to give my understanding in case someone else will have the same question.
We are realizing this because we are taking different samples. Hence the interested feature will be random hence it won't provide any meaningful information.
I'm not 100% sure of this though
@@justfacts4523 thanks for your reply
I enjoy both!
Hey Ritvik! May I know how to do this process for a multi-class classification problem? You have taken a regression problem as an example.
I would assume that for a classification problem, the approach remains the same. The only thing that differs for the classification problem, is that you would choose and observe the prediction for a single class value.
Can anyone recommend any good books for Data science in general and for such concepts and beyond? Thanks in advance!
Hi, great explanation. Can you please explain me how does Shapley values are calculated for classification problems?
Hi, i know it's late for you but I want to give my understanding in case someone else will have the same question.
Instead of considering the class as the output we can use the exact same concept by taking the probabilities generated by the last softmax layer (in case of a nn or any probabilistic like model)
Or eventually I think we can compute that probability by checking how many times that class has been "outputted"
I still do not understand how this would be applied to text
Yes, this is the best !
amazing video
Pen and paper is better. It would be awesome if you can share the notes. Thank you.
Thanks for the feedback!
Excellent explanation! The only question I have is that, sure, in practice you can (and probably should) probably calculate all these through random sampling of feature interactions (random permutations from step 1) because as the number of features increases, you would have a exponentially increasing number of feature interactions to have to be handled, rendering random sampling of features as the only viable method. My question is wouldn't you have to iterate through all possible feature interactions and all data set points for each in order to calculate exact Shapley values? In other words, is the method you proposed just an approximation of the correct values?
i know it's late but this is my understanding of it in case someone else has the same question.
Yes, we are getting an approximation of the correct values. But if the sample is large enough and considering that we are taking the expected value, according to the law of big numbers we are pretty confident to get an appropriate estimation of the measure
Hello Ritvik, I love your videos. I was wondering if there is a way to contact you. I had a couple questions about learning data science. Hope to hear from you soon, thank you.
Let me try to put it into my own words. In order to make it easy to understand, I have to simplify it by lying first. So here's a soft lie version: you have a sample with temperature 80, you replace it by a temperature from a random sample. So if the random sample has temperature of 70, then replace 80 by 70. Then you ask a question "If I convert this 70 back to 80, what will be the predicted difference?" If the difference is positive, it means the temperature of 80 is increasing prediction value. If it's negative, it's decreasing the prediction value. And this difference is called the SHAP value. We call a feature with large absolute SHAP value as important.
Now let's fix the lie a little bit: instead of only replacing the temperature, we also replace a few other features from the random sample to the original sample. But we still only try to convert back the temperature. Then we average the SHAP value by doing many random sampling to reduce variance.
Another thing to do even more is to calculate SHAP value for every sample, then you will have a global SHAP value instead of a local SHAP for a specific sample.
So this is pretty much an intense iterative process.
And that's it done.
NICE!
Thanks!!
I prefer the marker pen style too!
Thanks for the feedback! Much appreciated
liked and subscribed
cool man!
Great video but I do prefer the whiteboard style
Monginis Cake Shop?
The white board is the best
Not Fahrenheit 😁
Pen and paper is better.
Appreciate the feedback!
Whiteboard better :)
Noted!
for noobs like me trying to learn about new things, your handwriting makes me miss lots of things, Im not getting anything .
Whiteboard
Marker- Pen style does help with focus. But the tattoo on your hand doesn't. :P
I aborted the video mid-way and went on a google map hunt. :/
You made this so simple to understand, that I will get to Python and do this ASAP!! Thank you @ritvikmath