This is a brilliant tutorial on GLM in R with a very good breakdown of all the information in step by step fashion that is understandable for a beginner
Hi Ammar, sorry I missed this comment, but I would like to break a lance for odds ratios ;). Benefit of the log odds ratios is, I think, only that the sign corresponds to the effect direction. But the values are very hard to interpret. With odds ratios you can say things like "for a unit increase in x, the odds of y increase by a factor 2 (aka twice the odds)". Is there a benefit of using the log odds ratios that I'm overlooking?
If I understand you correctly, I think it's indeed possible to model a dependent variable with a tri-modal distribution with glm. Actually, you might not even need glm for that. Whether a distribution is multimodal is a separate matter of the distribution family. A tri-modal distribution might be a mixture of three normal distributions, three binomial distributions, etc. Take the following simulation as an example. Here we create a y variable that is affected by a continuous variable x, and a factor with three groups. Since there is a strong effect of the group on y, this results in y being tri-modal. ## simulate 3-modal data n = 1000 x = rnorm(n) group = sample(1:3, n, replace=T) group_means = c(5,10,15) y = group_means[group] + x*0.4 + rnorm(n) hist(y, breaks=50) m1 = lm(y ~ x) m2 = lm(y ~ as.factor(group) + x) summary(m1) ## bad estimate of x (should be around 0.4) plot(m1, 2) ## error is non-normal summary(m2) ## good estimate after controlling for group plot(m2, 2) ## error is normal after including group
Hi! Do you mean vastly different results, or very small differences? I do think some of the multilevel stuff could in potential differ due to random processes in converging the model, but if so it should be really minor.
The glm function is in the stats package, which comes shipped with the basic R installation. So you dont necessarily need other packages. But in the tutorial I do use some packages for convenience, such as the sjplot package for making a regression table. If you run this without sjplot the results are the same, but you'll need to do some calculations yourself. For instance, logistic regression gives log odds ratio coefficients, so you'd need to take the exponent (exp function) to get the odds ratios. Tldr; you dont need to install packages, but it does make life easier
Good Question! It's similar to ordinary regression, in that it just means: the expected value of y if x (or all x-es in a multiple regression) is zero. This is mainly interpretable if there is a clear interpretation of what x=0 means. For instance, say your model is: having_fun = intercept + b*beers_drank. In that case, the intercept is the expected fun you have if you haven't had any beers. Now saw we have a binomial model. Our dependent variable is binary, namely whether or not a person had a hangover the day after a party. This time, the effect is more like (but not exactly, i'm ignoring the link function): hangover = intercept * b^beers_drank. Notice that ^ in b^beers_drank. Thats the multiplicative part: we expect that the odds of having a hangover increase by a 'factor of b' for every unit increase in beers. But whats most relevant for us now is that an exponent of zero is always 1! So b^0 (zero beers) is 1. So here as well, it means that when x is zero, the intercept is just our expected value. If we've transformed our coefficints to odds ratios, then if we haven't had any beers, the intercept would represent the odds that someone had a hangover. So if the intercept is 2, it would mean that the odds that someone who didn't have any beers has a hangover is 2-to-1, so a probability of 0.66 (odds of 2-to-1 means 2 people out of 3). That sounds weird, but they probably had whisky instead. I don't know how much that helped. The key takeaway is that like with ordinary regression, it's mainly interpretable if you have a clear idea of what x=0 means.
Hi Kasper, thank you for wonderful video. I have a question, which is about R2 and R2 adjusted of GLM models on R. How we can get R2 and R2 adjusted on R console? On my console, I can not find these values when I run a code “summary()”. Any specific code to get them on console?
Hi, great question! The thing is, there actually isn't a R2 or R2 adjusted for GLM. Instead, to evaluate model fit, it is more common to compare models (in the second link in the description, see logistic regression -> interpreting model fit and pseudo R2). There ARE, however, also some 'pseudo R2' measures, such as the R2 Tjur seen in the video. These measures try to imitate the property of R2 as a measure of explained variance. You'll never get these scores in the basic glm output though, because there are many possible pseudo R2 measures. But there are packages that implement them. For instance, the 'performance' package has an r2() function which calculates a (pseudo) r2 for different types of models. I'd also recommend reading about the model comparison approach though (if you don't know about it already), because journals often like to see this rather than or in addition to some pseudo R2.
@@kasperwelbers Thank you so much for quick reply! It was really helpful and easy to understand:) One mor question! I will be conducting GLM in my master’s thesis. Which one would you recommend? 1. Report AIC value (and I would write like “this model had the smallest AIC value) 2. Try calculating pseudo R2 measures and report them
@@朝に弱い人 I'd actually recommend reporting Deviance AND some pseudo R2. The pseudo R2 is nice to help along interpretation, but deviance is more appropriate, and also provides a nice test to see if adding variables to a model provides a significant increase in fit. Say you have models of increasing complexity (i.e. adding variables): m0, m1 and m2. For glm's, you can then use: anova(m0, m1, m2, test = "Chisq"). In the ouput, the deviance column for the m1 row tells you how much deviance decreased compared to m0, and the pr(chi) column tells you whether this increase was significant (and same for m2 compared to m1). Alternatively, you could use sjPlot's tab_model and just add the AIC and/or deviance directly to the table: tab_model(m0, m1, m2, show.aic = T, show.dev = T).
@@kasperwelbers Thank you so much, Kasper! I will try calculating deviance and pseudo R2 using the code you suggested :) Can I ask another question via email or something? I’m sorry to be a pain, but I think you can answer another big question I have🙇♂️
@@朝に弱い人 No problem! I do however prefer to keep questions based on these videos confined to youtube (and not too big). Especially at the moment with the whole corona teaching situation I'm swamped with emails, and I do need to prioritize my direct students. For bigger questions, I also do think it's best to find someone at your uni (ideally supervisor or someone in same department). Not only because they supposedly can invest more time, but also because in more specific problems there tend to be differences across disciplines / traditions in how to do statistics.
I think sjPlot handles those pretty nicely! There's some great explanations on the website, under the regression plots tab: strengejacke.github.io/sjPlot/
Ahaha, not sure whether that's a question or a burn 😅. This is just a Blue Yeti mic in the home office I set up during the COVID lock downs. The room itself has pretty nice acoustic treatment, but I was still figuring out in a rush how to make recordings for lectures/workshops and it was hard to get clear audio without keystrokes hitting through.
This is a brilliant tutorial on GLM in R with a very good breakdown of all the information in step by step fashion that is understandable for a beginner
Very well explained !!! However, using the coefficients in the summary in my opinion is by far mush easier to understand than the way with tab model
Hi Ammar, sorry I missed this comment, but I would like to break a lance for odds ratios ;). Benefit of the log odds ratios is, I think, only that the sign corresponds to the effect direction. But the values are very hard to interpret. With odds ratios you can say things like "for a unit increase in x, the odds of y increase by a factor 2 (aka twice the odds)". Is there a benefit of using the log odds ratios that I'm overlooking?
Thank you for the tutorial. Is it possible to create a glm model with a variable to explain which has 3 modalities
If I understand you correctly, I think it's indeed possible to model a dependent variable with a tri-modal distribution with glm. Actually, you might not even need glm for that. Whether a distribution is multimodal is a separate matter of the distribution family. A tri-modal distribution might be a mixture of three normal distributions, three binomial distributions, etc. Take the following simulation as an example. Here we create a y variable that is affected by a continuous variable x, and a factor with three groups. Since there is a strong effect of the group on y, this results in y being tri-modal.
## simulate 3-modal data
n = 1000
x = rnorm(n)
group = sample(1:3, n, replace=T)
group_means = c(5,10,15)
y = group_means[group] + x*0.4 + rnorm(n)
hist(y, breaks=50)
m1 = lm(y ~ x)
m2 = lm(y ~ as.factor(group) + x)
summary(m1) ## bad estimate of x (should be around 0.4)
plot(m1, 2) ## error is non-normal
summary(m2) ## good estimate after controlling for group
plot(m2, 2) ## error is normal after including group
Thank you so much for the tutorial.
Thank you for these videos!
Hi, why is R studio producing different results even though I am using the same call and data.
Hi! Do you mean vastly different results, or very small differences? I do think some of the multilevel stuff could in potential differ due to random processes in converging the model, but if so it should be really minor.
Do you need to install any packages to run the glm code?
The glm function is in the stats package, which comes shipped with the basic R installation. So you dont necessarily need other packages. But in the tutorial I do use some packages for convenience, such as the sjplot package for making a regression table. If you run this without sjplot the results are the same, but you'll need to do some calculations yourself. For instance, logistic regression gives log odds ratio coefficients, so you'd need to take the exponent (exp function) to get the odds ratios. Tldr; you dont need to install packages, but it does make life easier
THANK. YOU.
Thankyou for this helpful video
Hi Kasper, what/how much does the intercept tells us in this case?
Good Question! It's similar to ordinary regression, in that it just means: the expected value of y if x (or all x-es in a multiple regression) is zero. This is mainly interpretable if there is a clear interpretation of what x=0 means. For instance, say your model is: having_fun = intercept + b*beers_drank. In that case, the intercept is the expected fun you have if you haven't had any beers.
Now saw we have a binomial model. Our dependent variable is binary, namely whether or not a person had a hangover the day after a party. This time, the effect is more like (but not exactly, i'm ignoring the link function): hangover = intercept * b^beers_drank. Notice that ^ in b^beers_drank. Thats the multiplicative part: we expect that the odds of having a hangover increase by a 'factor of b' for every unit increase in beers. But whats most relevant for us now is that an exponent of zero is always 1! So b^0 (zero beers) is 1. So here as well, it means that when x is zero, the intercept is just our expected value.
If we've transformed our coefficints to odds ratios, then if we haven't had any beers, the intercept would represent the odds that someone had a hangover. So if the intercept is 2, it would mean that the odds that someone who didn't have any beers has a hangover is 2-to-1, so a probability of 0.66 (odds of 2-to-1 means 2 people out of 3). That sounds weird, but they probably had whisky instead.
I don't know how much that helped. The key takeaway is that like with ordinary regression, it's mainly interpretable if you have a clear idea of what x=0 means.
Hi Kasper, thank you for wonderful video. I have a question, which is about R2 and R2 adjusted of GLM models on R. How we can get R2 and R2 adjusted on R console? On my console, I can not find these values when I run a code “summary()”. Any specific code to get them on console?
Hi, great question! The thing is, there actually isn't a R2 or R2 adjusted for GLM. Instead, to evaluate model fit, it is more common to compare models (in the second link in the description, see logistic regression -> interpreting model fit and pseudo R2). There ARE, however, also some 'pseudo R2' measures, such as the R2 Tjur seen in the video. These measures try to imitate the property of R2 as a measure of explained variance. You'll never get these scores in the basic glm output though, because there are many possible pseudo R2 measures. But there are packages that implement them. For instance, the 'performance' package has an r2() function which calculates a (pseudo) r2 for different types of models.
I'd also recommend reading about the model comparison approach though (if you don't know about it already), because journals often like to see this rather than or in addition to some pseudo R2.
@@kasperwelbers Thank you so much for quick reply! It was really helpful and easy to understand:)
One mor question! I will be conducting GLM in my master’s thesis. Which one would you recommend?
1. Report AIC value (and I would write like “this model had the smallest AIC value)
2. Try calculating pseudo R2 measures and report them
@@朝に弱い人 I'd actually recommend reporting Deviance AND some pseudo R2. The pseudo R2 is nice to help along interpretation, but deviance is more appropriate, and also provides a nice test to see if adding variables to a model provides a significant increase in fit. Say you have models of increasing complexity (i.e. adding variables): m0, m1 and m2. For glm's, you can then use: anova(m0, m1, m2, test = "Chisq"). In the ouput, the deviance column for the m1 row tells you how much deviance decreased compared to m0, and the pr(chi) column tells you whether this increase was significant (and same for m2 compared to m1). Alternatively, you could use sjPlot's tab_model and just add the AIC and/or deviance directly to the table: tab_model(m0, m1, m2, show.aic = T, show.dev = T).
@@kasperwelbers Thank you so much, Kasper! I will try calculating deviance and pseudo R2 using the code you suggested :) Can I ask another question via email or something? I’m sorry to be a pain, but I think you can answer another big question I have🙇♂️
@@朝に弱い人 No problem! I do however prefer to keep questions based on these videos confined to youtube (and not too big). Especially at the moment with the whole corona teaching situation I'm swamped with emails, and I do need to prioritize my direct students. For bigger questions, I also do think it's best to find someone at your uni (ideally supervisor or someone in same department). Not only because they supposedly can invest more time, but also because in more specific problems there tend to be differences across disciplines / traditions in how to do statistics.
how can i vizualized if some variables are factors like yes or no
I think sjPlot handles those pretty nicely! There's some great explanations on the website, under the regression plots tab: strengejacke.github.io/sjPlot/
use the function str(yourbasename). If the variable is not yet a factor you can transform it using the following yourbasename$nameof the factor
Thank you!
nice audio bro. you record in bathroom?
Ahaha, not sure whether that's a question or a burn 😅. This is just a Blue Yeti mic in the home office I set up during the COVID lock downs. The room itself has pretty nice acoustic treatment, but I was still figuring out in a rush how to make recordings for lectures/workshops and it was hard to get clear audio without keystrokes hitting through.
Tanner Rest
Weissnat Shores
Thomas Paul Wilson Eric Hernandez Melissa
Tabmodel doesnt work😮
Surely we can make it work. What error do you get?
@@gotnolove923 ah haha, that was me on another account that I was trying to delete.
Doyle Plaza
Kailey Islands
Hilll Streets
Garcia Paul Wilson William Young Karen