thankyou for bringing this intuitive video, I just had this thought yesterday. Please keep uploading videos like this it makes my intuitive more strong and closer to statistics.
Actually there is a better reasoning but I am still not sure about it... Sigmoid is derived through the linear regression on log odds of the two classes... So mx+c = ln(p/(1-p)) which gives p = 1/(1+e^-(mx+c))
This is a nice explanation, however one question is left open for me: We interpret the result of the sigmoid as probability. So sigmoid(x) results in some probability of something to be classified as some category. Let's assume the standard sigmoid(x) results in a value of 0.7. When I change sigmoid to use some other number k instead of e, this probability would change. Let's say it would now be 0.9 instead of 0.7. This appears to me as semantically completely different from 0.7. So I would conclude that with respect to the interpretation as probability, it is not arbitrary to choose e oder some other number k.
When we use the sigmoid function we are doing so because we can map from the real number space to the [0,1] space. So, in practice, this means that regression values can be mapped to probabilities. So, like you say, you might have some regression (x) value that maps to 0.7. But what you are generally interested in is not the 0.7 itself but rather the value of the sigmoid for the given data point relative to other data points. A concrete example might help to clarify: Say, we have a bunch of predictors (from a linear regression, say) - e.g. weather data, say, for temperature and pressure at some given location. And we want to combine these somehow via a linear relationship y = b0 + b1 x temp + b2 x pressure. We now want to use y (a real value) and map it to a probability for rain. So we use the sigmoid. And so we might get 0.7, like you say, for a given observation of temperature and pressure. Does that mean that we have a model which predicts 70% chance for rain. Not necessarily - and probably not even close. In practice, you will use the 70% relative to the value of the other observations. You might use a threshold value of 50% and say that all values above 50% should be classified as "expecting rain" and all values below as "not expecting rain". But then you might find that the 50% threshold for classification does not really hold up when you apply your model to historical data with known outcomes. However, if you tune the threshold (and explore other possible values, e.g. from 20%, 21% .... 69%, 70%), you might find that a threshold of 30% yields very high accuracy (even against data which you set aside and with which you didn't train your model). So, in other words, in practice, you rarely take the sigmoid function as a literal mapping from the real line to the probability line. You just allow it to perform a mapping to the probability line because here it helps to define, in some sense, a classification rule. And when you have this classification rule, you can fine tune the threshold to optimise your model. A long answer, I know, but I figured I would share this since I had wondered the same thing for quite a long time - until I saw how things worked in practice.
Thank you for making us love math even more.
Glad you enjoy it!
thankyou for bringing this intuitive video, I just had this thought yesterday.
Please keep uploading videos like this it makes my intuitive more strong and closer to statistics.
More to come!
Holy shit! I wish I could watch this video 6 years ago when I just got into machine learning. You did a great job! Thank you so much!
Thanks so much! Glad it was helpful!
Thanks! I really appreciate this bits of useful, subtle and insightful ideas about common objects in data science
Glad to hear it!
Makes sense. dy/dx = y (1- y ) if k=e. Great video!
Thanks!
Love this nonchalant explanation :)
Huh, so I guess this is like a tradeoff of annoyances where using e upfront is just less annoying than discovering ln(k) much later.
Nice explanation. Clarifies everything
Glad it was helpful!
Thank you for this explanation!
The best in the game for this kind of conteht
Thanks!
always high quality content
are operations with 'e' are more expensive then with 2 or 3?
Really good explanation. Keep it up :)
Thanks, will do!
Actually there is a better reasoning but I am still not sure about it... Sigmoid is derived through the linear regression on log odds of the two classes... So mx+c = ln(p/(1-p)) which gives p = 1/(1+e^-(mx+c))
Thanks for the explanation!
Of course!
This is a nice explanation, however one question is left open for me: We interpret the result of the sigmoid as probability. So sigmoid(x) results in some probability of something to be classified as some category. Let's assume the standard sigmoid(x) results in a value of 0.7. When I change sigmoid to use some other number k instead of e, this probability would change. Let's say it would now be 0.9 instead of 0.7. This appears to me as semantically completely different from 0.7. So I would conclude that with respect to the interpretation as probability, it is not arbitrary to choose e oder some other number k.
When we use the sigmoid function we are doing so because we can map from the real number space to the [0,1] space. So, in practice, this means that regression values can be mapped to probabilities. So, like you say, you might have some regression (x) value that maps to 0.7. But what you are generally interested in is not the 0.7 itself but rather the value of the sigmoid for the given data point relative to other data points. A concrete example might help to clarify:
Say, we have a bunch of predictors (from a linear regression, say) - e.g. weather data, say, for temperature and pressure at some given location. And we want to combine these somehow via a linear relationship y = b0 + b1 x temp + b2 x pressure. We now want to use y (a real value) and map it to a probability for rain. So we use the sigmoid. And so we might get 0.7, like you say, for a given observation of temperature and pressure. Does that mean that we have a model which predicts 70% chance for rain. Not necessarily - and probably not even close. In practice, you will use the 70% relative to the value of the other observations. You might use a threshold value of 50% and say that all values above 50% should be classified as "expecting rain" and all values below as "not expecting rain". But then you might find that the 50% threshold for classification does not really hold up when you apply your model to historical data with known outcomes. However, if you tune the threshold (and explore other possible values, e.g. from 20%, 21% .... 69%, 70%), you might find that a threshold of 30% yields very high accuracy (even against data which you set aside and with which you didn't train your model).
So, in other words, in practice, you rarely take the sigmoid function as a literal mapping from the real line to the probability line. You just allow it to perform a mapping to the probability line because here it helps to define, in some sense, a classification rule. And when you have this classification rule, you can fine tune the threshold to optimise your model. A long answer, I know, but I figured I would share this since I had wondered the same thing for quite a long time - until I saw how things worked in practice.
@@MalTimeTV thanks a lot for your answer
thanks bro, interesting video!
Glad you liked it!
Great info. Thank you.
Glad it was helpful!
Very simmilar to the logit function.
It looks like the logistic function
amazing
very nice 😎
Thanks!
❤
❤️
nice
i like my curves like that
sorry, you didn't explain anything.
🤔 *PromoSM*