Thank you so much! Soon I have an exam in text mining and coming from humanitarian studies I struggled a bit... You explained it so clearly... Thank you for your job, really appreciated!
probably the best explanation of this function I've seen. I have a question though; when you use chain rule, AFAIK if you set say y=1+e^-s, then dy/dx = -e^-s. So this would be the numberator and the denom would be (1+e^-s)^2. So, I don't see where that extra - comes from.
Conceptually the mistake you're making is you have a composite of two functions where the first has a major part in the denominator... differentiating the denominator's power goes up right!!! The chain rule is d/dx(f(g(x)) = f'(g(x))*g'(x). Here, f(u)= 1/u where u = 1+exp(-x), and g(x)= 1+exp(-x). Using the chain rule, f'(u)= -1/u^2 and g'(x)=-exp(-x). Putting together, we get -1/(1+exp(-x))^2*(-exp(-x)) = exp(-x)/(1+exp(-x))^2 I prefer to use the product rule rather than the chain rule myself for simple calculations like this.
Because the derivative of p(x) is conveniently p(1-p) = p-p². If we want to maximize the slope then maximize this function. Take its derivative (2nd derivative of original) and find when that is 0 1 - 2p = 0 1 = 2p p = 1/2 You can check it numerically too. Try p = 1/2 then tweak it a little above and below. (0.45)(0.55) = 0.2475 (0.50)(0.50) = 0.25 (0.51)(0.49)= 0.2499 When p = 1/2 is the sharpest increase in the probability of the target outcome we're modeling. Where is that exactly? p(x) = 1/2 when x = 0 You can run through the algebra but just verify: p(0) = 1/(1 + e^-0) 1/(1 + 1) = 1/2 If you didn't have plain x but instead a linear function as in the case of logistic regression f(x) = B0 + B1·X1 B1 will make the curve rise more steeply or slowly. So does increasing the students score make much difference? The intercept B0 represents a horizontal shift in the curve. It leads to the point where p(f(x)) = 1/2 f(x) = 0 = B0 + B1·x x = -B0/B1
This isn't just any sigmoid function. It's the logistic function and crux of logistic regression. If you let Y = 1 when your target outcome happens and 0 otherwise, we can model its probability as a function of x (what you called s on the board) thanks to it being bounded between 0 and 1. Pr(Y = 1 | X) = e^f(x) /(1 + e^f(x)) = 1 / (1 + e^-f(x)) This is also known as the expit function. The inner f(x) is a linear function such as B0 + B1·X1 The inverse of expit is logit and applying that will linearize for easier modeling. logit P = ln P/(1-P) = ln Pr(Y=1)/Pr(Y=0) This is log odds of the target outcome. logit Pr(Y=1) = f(x) = B0 + B1·X1 Now we can do some more familiar linear modeling. B1 controls how sharply or slowly the probability curve rises. It's a log odds ratio and additive on the log scale. It's a multiplicative effect on the original given by e^B1 B0 is a location parameter that will shift the whole thing left or right and helps indicate where that steepest rise occurs. Since the slope of P = P·(1-P) = P - P² it is maximized when P = 1/2 d/dp = 1 - 2P = 0 Where that occurs on the x-axis is when f(x) = 0 P(0) = 1/(1 + e^-0) = 1/(1 + 1) = 1/2 f(*X*) = B0 + B1·X1 = 0 X1 = -B0/B1 The simple one variable example on the board is equivalent to setting B0 = 0 and B1 = 1 -0/1 = 0 so we see P = 1/2 right on x = 0.
Great explanation, thanks! I understand the Sigmoid function limits unbounded numbers to be between 0 and 1 and, at the same time, states the diminishing value of marginal increases. But, why use the Euler’s number e? Would any other number do?
Any other positive number would give the same basic shape, but it would lack the symmetry you get with e. You can see that by noting how the derivative form is less convenient for other numbers.
Hey, thanks for the intuitive explanation. I had a doubt, the cumulative distribution function of the normal distrbution is sigmoidal, would that be a reasonable explanation for the choice of sigmoid making sense when it comes to explaining the probability of naturally occurring phenomenon ?
Thank you so much for that video !! Question are there any other fuctions like the sigmoid function that are used in data science, ML.. And how they are used ? Thanks
sigmoid is an activation function. ml uses about a dozen en.wikipedia.org/wiki/Activation_function because neurons are like scales, a squashing function balances out the extreme lows and highs so outliers don't mess up your predictions
Same i dont really understand this. I guess you could say it's because at 0 we are completely unsure whether the student is going to drop out. but at -1 and 1 we have some idea. But i find that to be a bit of a weak argument.
'a student got 9score, that's why I have very very high evidence that he is gonna drop out' -- so, the Alternative Hypothesis is "a student drops out" or this is Null Hypothesis ???
I dont understand the rate of change part... If the score changes from 0 to 1 why the probability should increase more than changing from 9 to 10? 0 to 1 just makes it a tiny bit more probable that a student will not drop out but its still very unsure.
It's not entirely clear to me why a jump in s from 0 to 1 is considered more significant than a jump from 9 to 10. Doesn't that depend on many assumptions? For example, a student being absent on a single day might not actually be significant (could be a valid reason), but would might tip s from 0 to 1. In general, why is the region around s=0 considered/needed to be more "sensitive?"
He does emphasise upon 0 to 1 being a big relative change as opposed to 9 to 10. What is the relative change when going from 0 to 1 ? It is (1 - 0) / 0 = +ve infinity. While the relative change going from 9 to 10 is (10 - 9) / 9 = 1/9
@@pushkarparanjpe True, but isn't that more the case when the numbers are actually quantifiable? Like in this case the scores dont really mean that much. If the model would have happend to output scores between 0 and 20, then our middle point would be 10 and the relative jump already becomes quite a bit less. Idk just seems kind of not very scientific to me to give that as the reason to use sigmoid
finally, someone decided to record the best videos on Loss functions ... thanks!
Glad it was helpful!
I really liked the fact that you explained this idea using the student example thorughout the video. This really helped to "keep it real". Great work!
I really appreciate how you throw in 'sanity checks' in your videos. An important skill to learn and use!
Thank you so much! Soon I have an exam in text mining and coming from humanitarian studies I struggled a bit... You explained it so clearly... Thank you for your job, really appreciated!
Beautifully illustrated. I like how you rephrase or represent an observation to uncover the how and why of things.
Cheers,
b
I am learning AI and this video is the best explanation of Sigmoid function and what we need it for. Thanks
how he expalin every topic in simple way with great intuittion and keep the things as simple as possible so learners not get bored .
you are a gem !! never had anyone cleared my fundamentals so deeply before.
Thanks a lot for explaining why it's needed at the first place, it helps with the intuition a lot. Loved this!
This video is wonderful. it really helps me understand why people design this function in the first place! Thanks!
HOLY SHIT I'M MINDBLOWN BY THIS VIDEO
Thank you for such a clear explanation of the sigmoid function in terms of information!
Beautifully put together, well done! And thank you so much for your efforts!
Thanks a lot . It was the best explanation that I have heard ever
You're most welcome
Cannot thank you more for your splendid lucid video on sigmoid
This is one of the best video for understanding sigmoid function. Thanks a lot Ritvik
Woww .I think you are one of the best teachers I've ever seen. 👏
Thank you! 😃
Excelent explanation, thank you ritvik! greetings from Argentina
Glad you enjoyed it!
Simple & intuitive explanation. Cheers!
Really neat and basic way of understanding the purpose
Fantastically explained brother ! really amazed by your teaching method !
Excellent work, can you make a video about the RELU and what makes it so effective?
Great in-depth tutorial on Sigmoid function!
The absolute best. Huge, huge fan of the channel
Thank you so much for very clear explanation.!
You are welcome!
Great explanation! Simplicity is the best policy.
great work, your explanations are better than the explanations of my lecturers at UCL
probably the best explanation of this function I've seen. I have a question though; when you use chain rule, AFAIK if you set say y=1+e^-s, then dy/dx = -e^-s. So this would be the numberator and the denom would be (1+e^-s)^2. So, I don't see where that extra - comes from.
Conceptually the mistake you're making is you have a composite of two functions where the first has a major part in the denominator... differentiating the denominator's power goes up right!!!
The chain rule is d/dx(f(g(x)) = f'(g(x))*g'(x). Here, f(u)= 1/u where u = 1+exp(-x), and g(x)= 1+exp(-x).
Using the chain rule, f'(u)= -1/u^2 and g'(x)=-exp(-x).
Putting together, we get -1/(1+exp(-x))^2*(-exp(-x)) = exp(-x)/(1+exp(-x))^2
I prefer to use the product rule rather than the chain rule myself for simple calculations like this.
4:19 why is it a huge change in relative terms?
Because the derivative of p(x) is conveniently p(1-p) = p-p².
If we want to maximize the slope then maximize this function. Take its derivative (2nd derivative of original) and find when that is 0
1 - 2p = 0
1 = 2p
p = 1/2
You can check it numerically too. Try p = 1/2 then tweak it a little above and below.
(0.45)(0.55) = 0.2475
(0.50)(0.50) = 0.25
(0.51)(0.49)= 0.2499
When p = 1/2 is the sharpest increase in the probability of the target outcome we're modeling. Where is that exactly?
p(x) = 1/2 when x = 0
You can run through the algebra but just verify:
p(0) = 1/(1 + e^-0)
1/(1 + 1) = 1/2
If you didn't have plain x but instead a linear function as in the case of logistic regression
f(x) = B0 + B1·X1
B1 will make the curve rise more steeply or slowly. So does increasing the students score make much difference? The intercept B0 represents a horizontal shift in the curve. It leads to the point where p(f(x)) = 1/2
f(x) = 0 = B0 + B1·x
x = -B0/B1
Great. Thank you for this explanation. Super.
Glad it was helpful!
Beautiful explanation!!
thank you for the excellent content, I am glad I subscribed. do you cover other subjects? I feel like I need some refresher lessons!
thats fantastic explanation, thanks a lot
will start watching your videos
thanks a lot! keep making this kind of video, you make it so easy to understand!
Thank you, really i undertand it easily, you are awesome, please continue
very smart way of explaining!
Another great video, thanks for your efforts
Many thanks!
Bravo!!.... This gives very interesting intuition.
This isn't just any sigmoid function. It's the logistic function and crux of logistic regression. If you let Y = 1 when your target outcome happens and 0 otherwise, we can model its probability as a function of x (what you called s on the board) thanks to it being bounded between 0 and 1.
Pr(Y = 1 | X) = e^f(x) /(1 + e^f(x))
= 1 / (1 + e^-f(x))
This is also known as the expit function.
The inner f(x) is a linear function such as
B0 + B1·X1
The inverse of expit is logit and applying that will linearize for easier modeling.
logit P = ln P/(1-P)
= ln Pr(Y=1)/Pr(Y=0)
This is log odds of the target outcome.
logit Pr(Y=1) = f(x) = B0 + B1·X1
Now we can do some more familiar linear modeling.
B1 controls how sharply or slowly the probability curve rises. It's a log odds ratio and additive on the log scale. It's a multiplicative effect on the original given by e^B1
B0 is a location parameter that will shift the whole thing left or right and helps indicate where that steepest rise occurs.
Since the slope of P = P·(1-P)
= P - P²
it is maximized when P = 1/2
d/dp = 1 - 2P = 0
Where that occurs on the x-axis is when f(x) = 0
P(0) = 1/(1 + e^-0)
= 1/(1 + 1) = 1/2
f(*X*) = B0 + B1·X1 = 0
X1 = -B0/B1
The simple one variable example on the board is equivalent to setting B0 = 0 and B1 = 1
-0/1 = 0 so we see P = 1/2 right on x = 0.
Awesome, you explained it in so simple way. Thank you so much.
Love this channel! Keep it up man
Thanks for making it so simple!
Amazing explaination! May your tribe increase
Fantastic explonation, thank you so much for this video!
I love this guy😭
You're the best!
agree!
He's the best
Thanks :)
Great explanation, thanks! I understand the Sigmoid function limits unbounded numbers to be between 0 and 1 and, at the same time, states the diminishing value of marginal increases. But, why use the Euler’s number e? Would any other number do?
Any other positive number would give the same basic shape, but it would lack the symmetry you get with e. You can see that by noting how the derivative form is less convenient for other numbers.
This is perfect. Thank you.
Fantastic explanation
this man is really awesome , hope he would be my college professor
Hey, thanks for the intuitive explanation. I had a doubt, the cumulative distribution function of the normal distrbution is sigmoidal, would that be a reasonable explanation for the choice of sigmoid making sense when it comes to explaining the probability of naturally occurring phenomenon ?
@Ritvik Is there any video you uploaded answering questions in the comments?
@@anishbabus576 It sucks but he has a very demanding day job (Academia), so gotta be happy with what we get.
Well explained! Great!
Thank you so much for that video !! Question are there any other fuctions like the sigmoid function that are used in data science, ML.. And how they are used ? Thanks
sigmoid is an activation function. ml uses about a dozen en.wikipedia.org/wiki/Activation_function
because neurons are like scales, a squashing function balances out the extreme lows and highs so outliers don't mess up your predictions
Thank you for giving your insight!
honest work, good job!
Thank you 😊
Thanks. It's well explained.
just amazing explained ! thanks !
You are welcome!
why is it a huge change from 0 to 1 or from 0 to -1 in relative terms?
I do not understand why this is the case, could you please explain that?
Same i dont really understand this. I guess you could say it's because at 0 we are completely unsure whether the student is going to drop out. but at -1 and 1 we have some idea. But i find that to be a bit of a weak argument.
any book name can you suggest where this sigmoid and softmax is explained?
Great explanation.. you've got a subscriber
Welcome aboard!
'a student got 9score, that's why I have very very high evidence that he is gonna drop out' -- so, the Alternative Hypothesis is "a student drops out" or this is Null Hypothesis ???
I am glad I have subscribed to your channel. Very informative.
Welcome aboard!
VEEERY NICE keep them coming 🥰
Why specifically use sigmoid function in logistic regression when there are other probabilistic function available?
Superb explanation
Very good video. Thanks
Lovely concept 👌
Thank you!!
thank you, a great video again!
amazing video
Jus Wow, Great One, Thanks man
Thanks!
thank you !!
You are amazing ❤
I dont understand the rate of change part... If the score changes from 0 to 1 why the probability should increase more than changing from 9 to 10? 0 to 1 just makes it a tiny bit more probable that a student will not drop out but its still very unsure.
I am glad I am here.
thanx, now i understand!
thanks for this content
Brilliant
thanks!
You earned a subscriber
This is better than university
Thanks man
Thanks Ritvik
It's not entirely clear to me why a jump in s from 0 to 1 is considered more significant than a jump from 9 to 10. Doesn't that depend on many assumptions? For example, a student being absent on a single day might not actually be significant (could be a valid reason), but would might tip s from 0 to 1. In general, why is the region around s=0 considered/needed to be more "sensitive?"
He does emphasise upon 0 to 1 being a big relative change as opposed to 9 to 10. What is the relative change when going from 0 to 1 ? It is (1 - 0) / 0 = +ve infinity. While the relative change going from 9 to 10 is (10 - 9) / 9 = 1/9
@@pushkarparanjpe True, but isn't that more the case when the numbers are actually quantifiable? Like in this case the scores dont really mean that much. If the model would have happend to output scores between 0 and 20, then our middle point would be 10 and the relative jump already becomes quite a bit less. Idk just seems kind of not very scientific to me to give that as the reason to use sigmoid
You look indeed very interesting in this video.
The shape of S in Sigmoid. I see what you did there
Great
Nice
bariya
based
God 🙏🙏
sigmoid make waifu?