Very true. I watched several other fews that only covered bits of what this video has covered. This truly is complete. I did however wished you covered the other statistics in the output e.g. what the standard error means.
Thanks for the great explanation! Just one quick question: the setosa variant has the smallest sepal lengths of them all (looking at a boxplot or scatterplot is obvious), so shouldn't the interpretation of the estimates at 18:00 be the other way around? That is: I. Versicolor has 1.6 mm larger sepals than I. setosa on avarage; I. virginica has 2.1 mm larger sepals than I.setosa on average.
Hi Eduardo, nice question! Setosa definitely has lower sepal length on average, if we do not look at petal length. But an ANCOVA does not compare group means, it compares regression lines. And if you run the following, you can see that the line estimated for setosa has a higher starting point: plot(Sepal.Length ~ Petal.Length, iris, col = Species) ANCOVA
@@Frans_Rodenburg I get it! The complete thing is: when petal length is 0, versicolor is 1.6 mm smaller than setosa. Actually you explained it on 12:20. Sorry about that. Again, thanks for the videos.
Yes that's right! ANCOVA isn't a different kind of model than multiple linear regression. It is just a name commonly found in literature for a multiple linear regression involving both numeric and categorical explanatory variables.
That was great! I really liked it. However I am wondering in the case when you have multi factorial mixed model how could you apply ANCOVA, as I am trying with "lmer" including my random effects in the formula along with covariate I am looking at, it gives me this error: "boundary (singular) fit: see ?isSingular"
Hi Marzieh, a singularity means that one of the estimated variance components was exactly 0, meaning it could not be estimated. This usually happens because (A) the variance component is very small, or (B) you do not have enough data to estimate it. The next step would be to simplify the random effects structure of your model (e.g., switch to a random intercept instead of a slope, get rid of nested random effects). When all else fails, you can resort to a fixed effects model with a dummy variable for the random effect. I explain this to some extent here: th-cam.com/video/Z1sA5ZGzVJI/w-d-xo.html
What exactly is the difference between the calls: "lm(Sepal.Length ~ Petal.Length + Species, data = iris)" and "lm(Sepal.Length ~ Petal.Length + Petal.Length:Species, data = iris)"? Honestly I find this confusing so I'm trying my best to figure this out. My interpretation is I want to perform an ANCOVA of Sepal Length by Petal Length, but accounting for how the effect is different depending on the species. Is that not an interaction as the effect of petal.Length depends on the Species? Where "+ Species" adds species as an effect, effectively creating a coefficent for petal length and species separately, but assuming the coefficent for sepal ~ petal is the same across species, and there is a coefficient that makes the difference across species. "Petal.Length:Species" renders the coefficients between Petal length and sepal length different depending on species. Wouldn't the former be more accurate? I suppose it begs the question when should you add as a categorical variable vs as an interaction?
Your interpretation of interaction in a regression model is correct. The first model gives you three different starting points for each species, whereas the second model gives you one starting point for every species, but three different slopes. I have rarely used the latter and find a more realistic way to model interaction (differences in slopes) to be allowing both the intercepts and slopes to vary. The shorthand notation for this is: lm(Sepal.Length ~ Petal.Length * Species, iris) And written in full it would be: lm(Sepal.Length ~ Petal.Length + Species + Petal.Length:Species, iris) Adding an interaction or not depends on your prior belief about the process as a domain expert (meaning you base the decision on your knowledge of biology for example, instead of statistics), or you can try fitting a model with and without an interaction to see which fits best. I explain that to some extent here: th-cam.com/video/n2kWXqR5nnw/w-d-xo.html
@@Frans_Rodenburg Ooooh, I see. That makes sense. I'm really wondering why my university courses never clarified what the different Rstudio call combinations did.
You want to be sure that if you include an interaction term, you use both the variables in the model. For instance, lm(Sepal.Length ~ Petal.Length + Petal.Length:Species, data = iris) should be: lm(Sepal.Length ~ Petal.Length + Species + Petal.Length:Species, data = iris) or the shorthand: lm(Sepal.Length ~ Petal.Length * Species, data = iris)
The reason why is that it doesn't really make sense to put an interaction term for something, if you are not including what the variable is interacting with. You want to have both the variables and the interaction between them.
Also, another quick shorthand if you have a lot of variables that you want interaction terms for... say, it was between addictions to different drugs: lm(patient_retention ~ (opioid + alcohol + stimulants)^2, data = addiction) ^^^ This would be the same as: lm(patient_retention ~ opioid + alcohol + stimulants + opioid:alcohol + opioid:stimulants + alcohol:stimulants, data = addiction) The (........)^2 just means you want interactions between each 2 variables in parentheses. If you did this one: lm(patient_retention ~ (opioid + alcohol + stimulants)^3, data = addiction) It would also add in opioid:alcohol:stimulants, because you said you wanted all interactions to up to 3 variables. It's just a quick way to not have to type out all the interaction terms ^~^
My god. So my new religion is sacrificing a goat every night to you for blessings of statistical education. My university lecturers are ironically incompetent at teaching in a course of statistics and scientific communication. I just have one question. Is performing ANCOVA in Rstudio literally the identical process to making a multiple linear regression?
Thank you, but one goat every two nights should suffice. ANCOVA is indeed identical to multiple linear regression with dummy variables. In fact, so is ANOVA. You can try for yourself by fitting a model with aov and lm and observe that you will obtain identical estimates. (In fact, aov is just a wrapper that performs linear regression and returns an ANOVA table, but it calls lm under the hood.) There's a couple of nice write-ups about that here: stats.stackexchange.com/q/175246/176202
You should be proud of this lecture. I have looked for weeks, and this is the first video that feels complete and clear.
That's so nice of you to say. I'm glad you found it useful!
Very true. I watched several other fews that only covered bits of what this video has covered. This truly is complete. I did however wished you covered the other statistics in the output e.g. what the standard error means.
For anyone still looking, I cover standard errors (as well as the rest of the regression table output) here: th-cam.com/video/f2ajESgqtcU/w-d-xo.html
Excellent explanation
Thanks a lot for this masterpiece.
This video save my semester final!😭
Haha, glad to hear it
This is amazing ❤
Thanks! super clear!
Thanks for the great explanation! Just one quick question: the setosa variant has the smallest sepal lengths of them all (looking at a boxplot or scatterplot is obvious), so shouldn't the interpretation of the estimates at 18:00 be the other way around? That is: I. Versicolor has 1.6 mm larger sepals than I. setosa on avarage; I. virginica has 2.1 mm larger sepals than I.setosa on average.
Hi Eduardo, nice question!
Setosa definitely has lower sepal length on average, if we do not look at petal length. But an ANCOVA does not compare group means, it compares regression lines. And if you run the following, you can see that the line estimated for setosa has a higher starting point:
plot(Sepal.Length ~ Petal.Length, iris, col = Species)
ANCOVA
@@Frans_Rodenburg I get it! The complete thing is: when petal length is 0, versicolor is 1.6 mm smaller than setosa. Actually you explained it on 12:20. Sorry about that. Again, thanks for the videos.
No problem, I'm glad it was useful!
Thanks for the video! What is exactly the difference between a multiple linear regression and an ANCOVA? Is it only the dummy coding?
Yes that's right! ANCOVA isn't a different kind of model than multiple linear regression. It is just a name commonly found in literature for a multiple linear regression involving both numeric and categorical explanatory variables.
Chapters have now been included.
That was great! I really liked it. However I am wondering in the case when you have multi factorial mixed model how could you apply ANCOVA, as I am trying with "lmer" including my random effects in the formula along with covariate I am looking at, it gives me this error: "boundary (singular) fit: see ?isSingular"
Hi Marzieh, a singularity means that one of the estimated variance components was exactly 0, meaning it could not be estimated. This usually happens because (A) the variance component is very small, or (B) you do not have enough data to estimate it. The next step would be to simplify the random effects structure of your model (e.g., switch to a random intercept instead of a slope, get rid of nested random effects). When all else fails, you can resort to a fixed effects model with a dummy variable for the random effect. I explain this to some extent here: th-cam.com/video/Z1sA5ZGzVJI/w-d-xo.html
What exactly is the difference between the calls:
"lm(Sepal.Length ~ Petal.Length + Species, data = iris)" and
"lm(Sepal.Length ~ Petal.Length + Petal.Length:Species, data = iris)"?
Honestly I find this confusing so I'm trying my best to figure this out.
My interpretation is I want to perform an ANCOVA of Sepal Length by Petal Length, but accounting for how the effect is different depending on the species. Is that not an interaction as the effect of petal.Length depends on the Species?
Where "+ Species" adds species as an effect, effectively creating a coefficent for petal length and species separately, but assuming the coefficent for sepal ~ petal is the same across species, and there is a coefficient that makes the difference across species.
"Petal.Length:Species" renders the coefficients between Petal length and sepal length different depending on species.
Wouldn't the former be more accurate?
I suppose it begs the question when should you add as a categorical variable vs as an interaction?
Your interpretation of interaction in a regression model is correct.
The first model gives you three different starting points for each species, whereas the second model gives you one starting point for every species, but three different slopes.
I have rarely used the latter and find a more realistic way to model interaction (differences in slopes) to be allowing both the intercepts and slopes to vary.
The shorthand notation for this is: lm(Sepal.Length ~ Petal.Length * Species, iris)
And written in full it would be: lm(Sepal.Length ~ Petal.Length + Species + Petal.Length:Species, iris)
Adding an interaction or not depends on your prior belief about the process as a domain expert (meaning you base the decision on your knowledge of biology for example, instead of statistics), or you can try fitting a model with and without an interaction to see which fits best. I explain that to some extent here: th-cam.com/video/n2kWXqR5nnw/w-d-xo.html
@@Frans_Rodenburg Ooooh, I see. That makes sense. I'm really wondering why my university courses never clarified what the different Rstudio call combinations did.
You want to be sure that if you include an interaction term, you use both the variables in the model. For instance,
lm(Sepal.Length ~ Petal.Length + Petal.Length:Species, data = iris)
should be:
lm(Sepal.Length ~ Petal.Length + Species + Petal.Length:Species, data = iris)
or the shorthand:
lm(Sepal.Length ~ Petal.Length * Species, data = iris)
The reason why is that it doesn't really make sense to put an interaction term for something, if you are not including what the variable is interacting with. You want to have both the variables and the interaction between them.
Also, another quick shorthand if you have a lot of variables that you want interaction terms for... say, it was between addictions to different drugs:
lm(patient_retention ~ (opioid + alcohol + stimulants)^2, data = addiction)
^^^ This would be the same as:
lm(patient_retention ~ opioid + alcohol + stimulants + opioid:alcohol + opioid:stimulants + alcohol:stimulants, data = addiction)
The (........)^2 just means you want interactions between each 2 variables in parentheses.
If you did this one:
lm(patient_retention ~ (opioid + alcohol + stimulants)^3, data = addiction)
It would also add in opioid:alcohol:stimulants, because you said you wanted all interactions to up to 3 variables.
It's just a quick way to not have to type out all the interaction terms ^~^
Great content. Too quiet!
Hi Melissa, thank you for the feedback! I will try to increase the gain for the next video.
My god. So my new religion is sacrificing a goat every night to you for blessings of statistical education. My university lecturers are ironically incompetent at teaching in a course of statistics and scientific communication.
I just have one question. Is performing ANCOVA in Rstudio literally the identical process to making a multiple linear regression?
Thank you, but one goat every two nights should suffice.
ANCOVA is indeed identical to multiple linear regression with dummy variables. In fact, so is ANOVA. You can try for yourself by fitting a model with aov and lm and observe that you will obtain identical estimates. (In fact, aov is just a wrapper that performs linear regression and returns an ANOVA table, but it calls lm under the hood.) There's a couple of nice write-ups about that here: stats.stackexchange.com/q/175246/176202
@@Frans_Rodenburg Thank you Deus.