Using tail alone actually results in an R^2 = 0.83 (p = 6E-4) compared to the adjusted R^2 = 0.79 for the multivariate model. Maybe could have been added to the video. Nice explanation regardless (y).
OH BOY love me some stat quest. I have a hard time during lecture to grasp everything when its being kept abstract but with these examples and a good sense of didactics this is a piece of cake! thanks josh! spreading this channel for sure
Your videos are literally saving my career. I don't know whether words would be sufficient to express my gratitude. I have just one request, will you please upload videos on factor analysis also just like you did for PCA ?
Great video as usual. Can you please explain survival analysis and Cox regression models, and how to run them in R. I think you will be the best person to explain them. Thanks in advance.
Yeah correct, adding a new variable weight actually reduces the Adjusted R^2 value,hence it is better to use tail alone to forecast the value as it as higher R^2 value.
Ho Josh - Thanks for the video.. i wanted to understand more on the assumptions of Linear Regression and check the Goodness of fit - pertaining to the Residuals - Normality, equal Variance etc...Looking for a Video... BAM BAM BAM....
The video is super good, the script on github fantastic; I feel so stupid but the output test in R which compares simple vs multiple model for me is pretty counterintuitive.
Thank you. What makes me think is that since weight and tail length are highly correlated, I would think using either one of them in the model is fine and these two predictors are interchangeable. But the results showed otherwise. I am wondering what might have made the difference? According to the previous lecture, I am guessing it is because that the sum of squares of the weight-only model is significantly bigger than the sum of squares of the tail-only model? Also, I remember there is a terminology about multicollinearity that if one predictor is highly correlated with the other, in multi-regression the matrix inverse will fail. I wonder if there is any indicator in the output about multicollinearity here.
I think they are only interchangeable if there is perfect colinearity between them which is not the case. Weight and tail do not provide the same information to our prediction.
Hi Josh - I'm confused. In other video on linear regression you mentioned that when fitting a 'plane' in multiple regression, if adding a second variable doesn't reduce the SSE (i.e. explain more variation in y) then the coefficient of var2 would be set to '0' and the equation essentially ignores it. But this video seems to say something slightly different - that is, if an additional variable doesn't add more explanatory power, then OLS will still assign a coefficient value, but the p value of that coefficient will be high signifying that the variable isn't significant. One way to try bridge the difference, is that the second variable still reduces the SSE (i.e. line fit is better using the plane vs simple line), but that if we plot the second variable alone on a line, the residuals (errors) would be so large that we're not sure if the pattern we're seeing (desribed by the coefficient) is due to pure random chance. Have I got this right?
You've got the main idea correct! Especially what you wrote in the third paragraph. However, I should clarify one thing in your second paragraph.... The p-value reflects a test of whether or not the coefficient is significantly different from zero. So yes, OLS will assign some value to the useless variable's coefficient - but the p-value says that that value is not significantly different from zero. Does that make sense? The reason that p-value is large, however, is exactly what you wrote in the 3rd paragraph. So I think you've got it!
joerich10 - I've been thinking more about what you wrote and now I must apologize for my original response - you had it right all along. You correctly bridged the difference in what I said. The coefficient may not be zero, but there is too much noise so that you can't be sure if the reduction in SSE is due to random chance or not. You were 100% correct!
Thanks :) Think I'm understanding it in more detail. Interesting how even a simple subject like linear regression get's tricky once you get into the granular, granular detail. Thanks for your follow up! :)
At 7:20 you mentioned we can use tail alone rather than weight too to save time. How do we quantify the cost (eg. increased error, reduced R-sq, some other more interpretable business metric) of not using weight? Because I believe practically, people don't make decisions based on p-values alone but there must be some translation to more practical factors along the way. For the 1st two metrics (error and R-sq), must we re-run the model by redefining the RHS of lm() each time we want to test a different predictor combination, or we can say something about the results of other predictor combinations from the summary() of 1 experiment alone. (Seems like both are true, which is contradictory). What would you say about the 3rd metric? (translating p-values to practical considerations) I'm also not sure if using tail alone definitely leads to a worse/equal result (as measured by error, or some other metric) than tail + weight? If I reason from the fact that using more predictors will never reduce R-sq, then this statement is true?
Adding both features will not make worse predictions, but if they are correlated (like in this example) than there may not be a statistical difference in removing one of them (like in this example). However, if there is no statistical difference, then using both will not improve our predictions.
have you done any videos on how to do calculation of p-value from F-value? Also any video on explaining degrees of freedom? For example, why residual standard error is mentioned as 1.19 on 7 degrees of freedom at videotime 03:21?
Hi StatQuest, I have a basic question. You mentioned at the beginning to plot the x and y variables(mouse weight and size in our case) to see whether there exists a linear relationship or not before proceeding with the modeling. Its easy when you have 2-3 variables. Suppose I have 50 variables in my data-set(assuming all are important) I assume it will be a big pain to plot graphs for all of these. What is the more efficient way to go ahead in such a scenario.
You can use lasso or elastic-net regression to pick out the variables that are important. If you want to learn more about this topic, start with my video on Ridge Regression: th-cam.com/video/Q81RR3yKn30/w-d-xo.html
Hey Josh Great Video! say we have 5 independent variables instead of 2 (i.e mouse length, height, weight, color, breed etc). Does this same format of understanding p-values apply: Is the lm function comparing an additive regression model of 4 independent variables (height, weight, color, breed) without the variable of interest (mouse length) to an additive model of the total 5 variables including the variable of interest (mouse length, height, weight, color, breed)? Or is it doing something else?
This is a good question. The p-value for a variable in a multiple regression is always a comparison between the full model, with all the variables, and the reduced model that is missing that specific variable. For example, if we had variables: length, height, weight, then the p-value for "length" is a comparison between the "full model", size = intercept + length + height + weight, and the "reduced model", size = intercept + height + weight. Does that make sense?
Hey Josh, hope your family and friend are doing well considering recent events. I have one more question. I know you work heavily with biology and genetics so this will hit home. I’m dealing with an anaerobic bioreactor. I’m trying to use multiple regression with methane flow from the reactor as the dependent variable, and mcrA (methanogen copy number log normalized by gram of reactor sludge) and temperature as my independent variables. Im using multiple regression in R to predict methane flow the p-value of Temperature is 1.47e-06*** and methanogens p-value is 1.84e-03** similar to your example, and they are not collinear. I would like to say that based on multiple regression analysis the combination of mcrA and temperature is not more statistically significant than temperature alone. That seems justifiable, correct? sorry for the dense questions
@@jahhedouglas242 What I would do is look at the adjusted R-squared for both variables, and then for each variable individually and see how much that changes. That will tell you how important it is to combine both variables for making predictions or not. If the change in R-squared is large, then it is important to use both variables. If not, then you can get away with using just one.
Hi Josh, thanks for your videos, they are very helpful. I am a bit confused. Isn't the explanation at 6:30 backwards? When you have the red square on weight, isn't that line saying you are seeing the effect of weight on the size adjusting for tail? If this is the case, then the formula below should be the following: we are comparing the complex model where "size=y-intercept + slope1 * weight + slope2 * tail" to the simple model "size=y-intercept + slope2*weight", not tail. Wouldn't this be right? And likewise, at 6:55, that's telling you the effect of tail on size controlling for weight. I am a graduate student and I am preparing for a talk where I have to talk about my data, and I use linear regression to model effect of genotype and treatment on certain behavior scores. In case they attack me on my analysis, I'd like to be able to interpret just about every line that R puts out, which this is part of. Your lessons are huge help :) Any feedback from anybody is welcome
The explanation in the video is correct. On the line for "weight", the p-value tells you whether or not "weight" is important or not. Thus, we are comparing size = intercept + weight + tail vs. size = intercept + tail. If there is a big difference, then "weight" is important, and the p-value will be small. If you want to double check this, here's what you do: Step 1) Run the sample code that comes with this video: statquest.org/2017/10/30/statquest-multiple-regression-in-r/#code Step 2) Make sure you understand how that code works - go through it with the video and go through the comments in the code. Step 3) Now let's create a simple model of size = intercept + tail: simple.regression.tail
@@statquest I do have a questions though. So I get that we test whether weight is important or not by comparing the multiple regression with simple regression with tail. What if we have another independent variable, say food intake? in this case, the multiple regression model would look like size~weight+tail+foodintake. for the sake of argument, suppose we have the data and ran lm(size~weight+tail+foodintake), and suppose we take summary of that. We'd have intercept, weight, tail and food intake in the coefficients. In this case, what comparison would R be making to come up with importance of weight, tail and foodintake? In other words, what would R do to test whether weight is important on size or not, with 3 variables? Would it compare size~weight+tail+foodintake to size~tail or size~foodintake? Or would it compare size~weight+tail+foodintake to size~tail+foodintake? I would like to understand what R is thinking in this situation, as you have described it with weight and tail example. Can you help me understand this? Thanks!
If we had weight, tail and foodintake, then the p-value for weight would compare the full model: size = intercept + weight + tail + foodintake to a model without weight: size = intercept + tail + foodintake. In other words, we only leave one variable out of the simple model.
Hello Josh, first of all, thank you for your videos. Could you please answer one question about the multiple regression summary output? You explained p-value of each coefficient one by one. You said that the p-value of `weight` is comparing `weight` + `tail` to `tail`. Was this a mistake? The p-value for `weight` is not for comparing `weight` + `tail` to `weight`?
Thank you for the video! In a previous video you explained how to calculate the pvalue for the R^2. But how to calculate the pvalue for each coefficient?
Hi Josh . I would first like to thank you for your videos which have made complicated concepts easier to understand. I do have a question about the p-values in the multiple regression though. If there were more than two variables, is the p-value derived by comparing the linear regression model with the one variable in question, to the model using all the other variables, or does it always compare it to a single variable (simple regression)? If it is the case that it is always compared to a single variable, how is the variable for the simple regression model chosen? Thanks
The p-value for a specific variable reflects the difference between the full model that contains that variable (and all of the other variables) to the reduced model that contains all variables except for the one of interest.
Hi Josh - Thank you for your video tutorial. I used your data sample and ran a simple regression (size ~ tail) and got a marginal slightly higher Adjusted RSQ (0.808 vs 0.800) and a much lower P-Value for the tail slope (0.0006 vs 0.0219) compare to the multiple regression (size ~ weight + tail). Does that make sense? Does that mean we can just use a simple regression instead? (Assuming the data is already available and no additional effort for measuring the mice tails needed.) - Anthony
@@statquest Hi Josh, Thank you for your prompt reply. Originally, I ran it on Excel (and the results were the same as the ones that you had in this video for multiple regression and simple regression (size ~ weight)), but I just ran it with your code for size~tail and it came out to be the same as the Excel version. --------------------------------------------------------- Code: simple.regression2 |t|) (Intercept) 0.4774 0.5777 0.826 0.435848 tail 1.2355 0.2098 5.889 0.000606 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7844 on 7 degrees of freedom Multiple R-squared: 0.8321, Adjusted R-squared: 0.8081 F-statistic: 34.69 on 1 and 7 DF, p-value: 0.0006059 --------------------------------------------------------- I have always thought that a multiple regression would always yields better RSQ and P-value! I am wondering if I ran a Lasso Regression on the data, the model would drop the weight variable and just use tail instead. - Anthony
Multiple regression doesn't always give you a lower p-value because of the degrees of freedom. However, it should give you at least as good, if not better, R-squared. And that's what I got: When I just run "size ~ tail"... summary(lm(size ~ tail, data=mouse.data)) ...the raw R-squared (not the adjusted R-squared) is 0.83 and the p-value is 0.0006. When I run "size ~ tail + weight"... summary(lm(size ~ tail + weight, data=mouse.data)) ...the raw R-squared (not the adjusted R-squared) is better, 0.85, however, the p-value, 0.003, is worse because because "tail" is highly correlated with "weight" and doesn't add much information compared (however, we have worse degrees of freedom because we need to estimate the extra parameter for weight).
@@statquest Thanks again for your help, Josh! So the multicollinearity problem causes the p-value in the multiple regression to be lower than if we run size~tail along? And shouldn't we normally compare the model by the adjusted R-squared? By the way, since you are here, let me side track a little. In your gene and mouse example in other video (in Logistic Regression?), you said each gene is a variable. I am not in the biology field, so I do not know how to relate a gene can be a variable, could you elaborate and help me to understand? (Nevertheless, As you might or might not be aware, there are typos in your texts in some of your videos? Do you care to know? Or you already knew and do not want to bother?)
@@anthonysun2193 1) Yes, we should always use the adjusted R-squared, and the adjusted R-squared gets smaller with more parameters. I only mentioned the non-adjusted R-squared because you said "I have always thought that a multiple regression would always yields better RSQ", and I wanted to point out that that statement was correct. However, multiple regression does not always yield a better "adjusted R-squared" because the adjustment accounts for the increased number of parameters. Does that make sense? 2) Here's an example: Some people have naturally black hair and some people have naturally brown hair. This is due to differences in the genes that code for hair color. So we can use that gene as a variable to separate people into different groups. 3) I am aware that my videos are full of typos. Because youtube does not allow me to edit my videos once they are posted, I am only interested in hearing about the typos if they result in confusion or in some sort of conceptual error. Depending on how bad the typo, I can mention it in a pinned comment, or I can delete the video.
Thanks for the video. I know this is for numeric data, but is there a model for predicting a categorical value? Eg predicting what career a student will choose based on scores in different subjects in high school?
Hi, in multiple regression and we are using weight, tail and ears (or any 3rd parameter) to predict size, in the R coefficients, the weight line will compare multiple regression vs single regression? BR YG
size ~ weight*tail suggests that there is an "interaction" between weight and tail, which means there is a non-linear relationship between weight and tail.
@@statquest which means there is a non-linear relationship between weight and tail,why? why not the non-linear relationship between weight and Size? Moreover the concept of interaction just as well to qualitative variables or a combination of quantitative and qualitative variables.
Good morning. At the end of the video you prompt viewers for ideas. Well, I would like to propose you this: we are given these woooowww Bioinformatics books, you know, data mining for Bioinformaticians, Introduction to Bioinformatics, introduction to Genomics and Proteomics... and well, I will be honest as for my sensation: I think that their writing style fails to get the reader to the point. We spend the whole day on half the chapter and at the end of the day we have acquired no knowledge and we just got tired without any particular reason. I mean, their stat approaches are qualitative (for example, there is no k-means algorithm or the way it is explained makes no sense) and since an bioinformatician has usually no background in ... Biology or Medicine, we cannot understand what stat method should be used when. Things became really hard in Biomarkers discovery and it seems that 'incomprehensible' stat methods outcrop (something like Rambo, who would suddenly appear from inside mad, imagine something like that). I mean, from a qualitative perspective, all angelic. But not sure about the whole stat procedure. Would it be as a fatigue for you to make some videos on "When we (should) use what?" ? Also using a general, a catholic Diagram would help the most, I think. Thank you!
@@statquest Thank you for consideration. Well, in order not to be unfair towards the authors, I would like to mention the following: Perhaps books are just fine and eventually incomprehension may occur due to many professors' habit to select various chapters from various books, a strategy that modulates/leaves an impression of coherence/consistency. I really do not know! And, you see, there is no time to read two or three books from start to end within a semester! For example, I give a battle to end Fourier Applications book (this specific book is great!) as a free-time self-improvement and still I 've only gone up to chapter two so far due to other obligations. Thank you once again!
Could you please answer one question? I did the SLR for Tail and the p-value was very small (0.0006) against 0.003 for MLR. The variance in R2 was also minimal (0.83 for SLR and 0.85 for MLR). In such case using only tail to predict mouse size seems better and it matches your inference. Now I believe the comparison method shown in the MLR video would also apply to two MLR (one with lesser variables). In such case is it better to carry out the the comparison technique you showed (the one where we compared SLR and MLR to find comparative R2 and p-value)? Or surmising the same from the p-values of each variable of the MLR summary would suffice? Here i believe we can get the p-value but still need to check R2 and would need to carry out another MLR with reduced variables anyway. If so, is there way to carry out the comparative method in R?
thanks for the video,U are the best video I watched so far in this subject matter. However, I don't quite understand the p-values under weight and tail row.
Hello StatQuest, I have a quick syntax question. Lets say you are trying to predict mouse size based off of like 50 features measured from the mouse. Is there a way not to have to the names of all 50 features when you get to the: lm(size ~ weight + tail + ...+ 50 feature) part of the code. I tried saving the column names of the data frame in an array and doing: lm(size ~ column_names_array) but I get an error. Thanks in advanced.
this video was very helpful. quick question - what made you come to the conclusion that an interaction exists between weight and tail through the graph visualization (approx. 5 mins)? is it because a pattern exists? I'm trying to understand how to identify an interaction between two numeric variables graphically. What would a scatter plot look like if no interaction exists - just highly randomized distribution of data with no identifiable pattern? I think that's right, but just want to confirm. Thanks for your help - these videos are a lifesaver.
At 5:00 I say that there is a correlation between weight and tail. Are you asking about the correlation? The term you used, "interaction" is specific statistical term that means something very different, so I want to make I understand what you are asking about.
@@statquest thanks. i'm wondering about interaction and think assumed they were the same. thank you for pointing out there's a difference between the two
When you called the summary of the multiple regression, the computer told you that the multi model was not more significant than the simple model using tail alone. If so, why was the P score in the multi variate regression under "Tail" a little worse than it was when you called a simple regression using tail only? I am referring to 6:45 in the video
The p-value for the simple regression, at 2:52 , is for the model: size = intercept + weight. This p-value is 0.012 and it compares two models, the one we specified, size = intercept + weight, to the null, which is size = mean value for size. The second p-value you, which is the one was asked about at 6:45, is for the model size = intercept + weight + tail compared to the model size = intercept + weight. The p-value, 0.0219, tells us that the the model is significantly better when we add "tail".
This was a great video. Does the same applies for multiple regression models with one numeric and one factor variables (for explaining the coefficients)? I understand that you can't design a pairs plot with factor variables as they need a box plot but are the rest of the steps same as with 2 numeric? Also, don't we need to run a diagnostic plot of our model to check for assumptions validations? Great videos!
The same applies with numeric and factor models. For details: th-cam.com/video/Hrr2anyK_5s/w-d-xo.html And you should always check your assumptions and outliers.
thank u for this. trying to create and plot a system for sales where the recommendation is coded in 0 and 1. tried to plot but I am getting straight lines on two cut offs say price points. would you appreciate your help in this. to predict recommendation this is for a training project for introduction to R
Thanks for these amazing videos. Great explanations. One question, how do I add the output to the plots? Sorry for asking for R code, I find your codes easier to use and understand.
@@kennethssebambulidde9238 See: stackoverflow.com/questions/3761410/how-can-i-plot-my-r-squared-value-on-my-scatterplot-using-r and lukemiller.org/index.php/2012/10/adding-p-values-and-r-squared-values-to-a-plot-using-expression/
how do you plot the multiple regression line? Like in the first example you used abline(regression) and I thought it might also work for the multiple one, but abline(multiple.regression) doen't work.
I don't think there is a straightforward way to do this. However, I found this discussion on how to do it by hand: stackoverflow.com/questions/17615791/plot-regression-line-from-multiple-regression-in-r
Hi Josh. Thanks for the excellent videos, it really helps me to understand statistics. I have a question about different types of statistical methods: If you want to categorize different statistical methods, is it right to say that there are three main categories called:1)Orthodox statistics, 2)Data-Driven methods, and 3)Bayesian inference?
@@nasrintaghavi6877 Ahh! I see. It's a good question, and a rather philosophical one. First you have to define Statistics. Then you have to answer the question, "are all machine learning algorithms a type of statistics?" I don't know the answer to that. I agree that there is a category of "data-driven methods" - however, I'm not sure all of them would also be considered statistics. It is my hope someone else might chime in on this thread and give their thoughts. If not, I'll post it to the StatQuest community page.
@@statquest Yeah, actually that's the problem. I don't know if data-driven methods are in the statistics category at all, and if yes, are they in Frequentist category or they are another category by themselves. I'm reading a review paper in my field and they have categorized machine learning in the statistical category but haven't devided statistical methods into sub-categories.
But looking at the weight x tail graph, seems like there's a correlation between the two variables. So shouldn't one of the variable be removed to prevent multicollinearity?
@@statquest rewatched the end and yeah you did. It wasn't explained it in terms of multicollinearity though (unless I missed it?) which was why I asked haha. On a seperate note, I'm trying to learn more about interaction terms in multiple regression but didn't come across any video from you about it. I'm trying to run a regression model but the interaction terms have high vif (which is probably obvious since it's an interaction term). Should the interaction term still be included?
@@rongxuantan5406 What is the goal of your model? If you just want to make predictions, you can apply regularization to it (Elastic-Net) and that will take care of any multicollinearity problems.
@@statquest thanks for replying! I'm trying to see how a change in the independent variables will affect the dependent variable. So basically I'm trying to find out the coefficients and whether the variable has a statistically significant effect
I have seen in minitab an option where you can get a table that summarizes information (r sq, r sq adjusted, mallows cp and vif) for an dependent variable vs every COMBINATION of independent ones. Is there something similar in R and how can I do that?
@Josh Starmer Can you please throw some light on multicollinearity and Variation Inflation Factor (VIF). These are useful concepts when it comes to multiple regression.
Thank you! I did not have plans for polynomial regression - but I could make a video pretty easily - it's the same as multiple regression except instead of just plugging in "weight" to predict "size", you plug in "weight-squared", or "weight-cubed" to predict "size". That's all there is to it.
This is a great question -- the answer is..."it's complicated". Sometimes you know in advance what the relationship is. If not, you can plot the data and see if the relationship is obvious. If not, you can start with a simple model. For example, say like you want to use "age" to predict "size". The model might be, size = y-intercept + B1 * age + B2 * age^2. So that would be a polynomial of the form, y = a + bx + cx^2. You can then look at the residuals and see if they are relatively random or form a pattern. If they are relatively random, you're then look to make sure the coefficients, B1 and B2 are both significantly different from 0. If so, you are done. If not, then you can probably remove that term from the equation. If the residuals form a pattern, you can add another term... age = y-intercept + B1*age + B2*age^2 + B3*age^3... etc. Does that make sense? The length of this answer is a good argument for me to make a video, but it might be a while since I'm working a lot on the machine learning stuff right now.
Hello Josh please help me to understand below points: 1. If we are considering p value to identify whether variable is significant or not then why don't we remove weight from final model. 2. After adding weight and tail in model our p value is significantly improved what is the reason behind that? 3. If combinations of weight and tail improve model then weight also needs be significant why it is not? Thanks in Advance
Could you go through some other basic data visualisation and stats methods in R? Your PCA video was very handy, but I'd love help with e.g. CVA, MANOVA
after regression for the 3 variables, how do we plot all 3 in a given space together? abline was used for simple regression but here using it, cant be used for 3 variables.
You can't, unless you're using a 3D graph. Otherwise, you can use Principal Component Analysis for reducing the the number of variables from 3 to 2, which you can then plot (there are some videos in the channel explaining how to do this)
Do you have a video explaining what it means to control a variable statistically? Some of the explanations I'm finding on TH-cam just use the words in the term as the definition and it's not intuitively clear to me to sink in :/ Btw thanks a lot for these clearly explained videos 👍😬
I don't have a video on the topic, but the idea is pretty simple. If you want to find out if a drug is effective or not, you give it to one group of people and another group of people do not get the drug and a 3rd group of people get a placebo. This third group of people controls for random things happening that are not associated with the actual drug and it is an example of "controlling for a variable".
@@statquest Thank you. I understand it from that experimental design perspective, but for some reason I can't visualize it when someone says it about controlling variables in a statistical model. Say for data a researcher didn't collect. Are they essentially running a multiple linear regression and comparing the output to see if the model has a better R-squared than one with the single variable like you described in this video?
@@daneshj4013 I'm not sure I understand what you mean. In my example, we have three variables: One variable controls for someone not taking any drug at all by representing that set of measurements. One variable that controls for the placebo affect by representing that set of measurements. And one variable that controls for the effect of the drug, by representing that set of measurements.
@@daneshj4013 If I understand you correctly, you are talking about batch (or confounding variables) effects. Data was collected multiple times, by multiple techs, in multiple labs. Or maybe the data is sequencing data that is from the same experiment, but was run on different machines in order to achieve full read depth. You would use the same design as described in the video, but you should also perform PCA to determine if batch effects exist. Even if you are missing essential information about the samples (say gender or machine type), PCA might force you to ask better questions that can be used to direct future research.
@@statquest I think he meant, "What do they mean when they say to make a linear model and control for [blahblahblah] variable?" As in, including the variable in your model = controlling for the variable in the model. "The effect of blahblahblah variable when controlling for all other variables in the model is ...."
@@statquest thanks. It is defiantly not easy to give all the potential steps to do a real trustworthy regression. I guess the analysts need to get experience themselves.
@@juanwang3705 Alternatively, you can apply ridge or lasso regression techniques do deal with this type of problem automatically. For details, see: th-cam.com/video/Q81RR3yKn30/w-d-xo.html th-cam.com/video/NGf0voTMlcs/w-d-xo.html th-cam.com/video/1dKRdX9bfIo/w-d-xo.html and th-cam.com/video/ctmNq7FgbvI/w-d-xo.html
If the slope is constant, then you can just plug that into the formula. For example y = intercept + 23 * height. In this case, we plug in 23 for the slope, but the intercept is still something we solve for.
Thank you for your video! However, if I use more than 2 variables, e.g. 5 variables, to build a linear model, how could I interpret the summary of the linear regression. Especially the 6 lines for intercept and 5 variables, what will the p-value means for each variable? Will it compare the model to the model without that variable? Thanks so much.
I had the same question. For example if you have 5 variables (a,b,c,d,e) and want to see if you can predict a by multiple regression of b,c,d and e, so your equation would look like this : a = y-intercept + slope1 x b + slope2 x c + slope3 x d + slope4 x e. You will get 4 lines in your coefficient part of summary. Each line would correspond to b,c,d or e and have a p-value (Pr(>|t|) at its end. For example, you would get a p-value = 0.45 for line b, which would mean that you compare the multiple regression model (previous equation) to another multiple regression model in which you take into account all other variables except b! The compared equation would look like this: a = y-intercept + slope2 x c + slope3 x d + slope4 x d. So in some way you're checking how b is helpful in your data to predict a. In my example, the p-value = 0.45 > 0.05 (if alpha = 0.05) so it means that the multiple regression model including the b variable is not significantly better to predict a compared to the multiple regression model without the b variable. In conclusion, you could get rid of b to predict a. etc etc for each line
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Thank you for easily explaining something that my professor nor my book was able to explain well
Hooray! :)
Thank you so much for these video! They were very helpful in helping me to build an MLM for my MS degree! Thank you!
Bam! :)
Using tail alone actually results in an R^2 = 0.83 (p = 6E-4) compared to the adjusted R^2 = 0.79 for the multivariate model. Maybe could have been added to the video. Nice explanation regardless (y).
OH BOY love me some stat quest. I have a hard time during lecture to grasp everything when its being kept abstract but with these examples and a good sense of didactics this is a piece of cake! thanks josh! spreading this channel for sure
Yeah and one sees that, that what makes it so good! You just realize that some proffesors dont really have fun teaching...
I was exactly looking for this. Thank you so much! Very well explained :) saving my thesis :D
Glad it was helpful!
Your videos are literally saving my career. I don't know whether words would be sufficient to express my gratitude. I have just one request, will you please upload videos on factor analysis also just like you did for PCA ?
I'll keep that in mind.
wow this makes so much more sense than what my professor was trying to say thank you!
Hooray!
thanks. Spent like 5 hours trying to understand this
Glad it helped!
This what exactly i needed right now. Thanks a lot
You're welcome! :)
I love your video, I like the way you explain the regression. Thank you
Glad it was helpful!
Great job! Exactly what I searched for!
Awesome, thank you!
I love your intro. The rest is also good but the "totes cray cray" bit was funny.
Thanks! :)
Best video so far
Thanks!
Great video as usual.
Can you please explain survival analysis and Cox regression models, and how to run them in R.
I think you will be the best person to explain them.
Thanks in advance.
I'll keep those topics in mind.
stat teachers have the best personalities, try and change my mind.
:)
These are great videos. So grateful for the clarity. Liked and Subscribed, keep going!!
Awesome, thank you!
This is channel of treasure...
Thank you for this video!
You are welcome!
Yeah correct, adding a new variable weight actually reduces the Adjusted R^2 value,hence it is better to use tail alone to forecast the value as it as higher R^2 value.
:)
Thanks so much for this!
:)
love your work
Thank you so much 😀
very helpful.thank you
Thanks!
Ho Josh - Thanks for the video.. i wanted to understand more on the assumptions of Linear Regression and check the Goodness of fit - pertaining to the Residuals - Normality, equal Variance etc...Looking for a Video... BAM BAM BAM....
Great as usual. Can you do a quest for mixed models?
I'll keep that in mind!
The video is super good, the script on github fantastic; I feel so stupid but the output test in R which compares simple vs multiple model for me is pretty counterintuitive.
You're not alone. It's hard for me to remember the output as well.
Thank you. What makes me think is that since weight and tail length are highly correlated, I would think using either one of them in the model is fine and these two predictors are interchangeable. But the results showed otherwise. I am wondering what might have made the difference? According to the previous lecture, I am guessing it is because that the sum of squares of the weight-only model is significantly bigger than the sum of squares of the tail-only model? Also, I remember there is a terminology about multicollinearity that if one predictor is highly correlated with the other, in multi-regression the matrix inverse will fail. I wonder if there is any indicator in the output about multicollinearity here.
I think they are only interchangeable if there is perfect colinearity between them which is not the case. Weight and tail do not provide the same information to our prediction.
Hi Josh - I'm confused. In other video on linear regression you mentioned that when fitting a 'plane' in multiple regression, if adding a second variable doesn't reduce the SSE (i.e. explain more variation in y) then the coefficient of var2 would be set to '0' and the equation essentially ignores it.
But this video seems to say something slightly different - that is, if an additional variable doesn't add more explanatory power, then OLS will still assign a coefficient value, but the p value of that coefficient will be high signifying that the variable isn't significant.
One way to try bridge the difference, is that the second variable still reduces the SSE (i.e. line fit is better using the plane vs simple line), but that if we plot the second variable alone on a line, the residuals (errors) would be so large that we're not sure if the pattern we're seeing (desribed by the coefficient) is due to pure random chance. Have I got this right?
You've got the main idea correct! Especially what you wrote in the third paragraph. However, I should clarify one thing in your second paragraph.... The p-value reflects a test of whether or not the coefficient is significantly different from zero. So yes, OLS will assign some value to the useless variable's coefficient - but the p-value says that that value is not significantly different from zero. Does that make sense? The reason that p-value is large, however, is exactly what you wrote in the 3rd paragraph. So I think you've got it!
joerich10 - I've been thinking more about what you wrote and now I must apologize for my original response - you had it right all along. You correctly bridged the difference in what I said. The coefficient may not be zero, but there is too much noise so that you can't be sure if the reduction in SSE is due to random chance or not. You were 100% correct!
Thanks :) Think I'm understanding it in more detail. Interesting how even a simple subject like linear regression get's tricky once you get into the granular, granular detail. Thanks for your follow up! :)
No problem!!! :)
At 7:20 you mentioned we can use tail alone rather than weight too to save time. How do we quantify the cost (eg. increased error, reduced R-sq, some other more interpretable business metric) of not using weight? Because I believe practically, people don't make decisions based on p-values alone but there must be some translation to more practical factors along the way.
For the 1st two metrics (error and R-sq), must we re-run the model by redefining the RHS of lm() each time we want to test a different predictor combination, or we can say something about the results of other predictor combinations from the summary() of 1 experiment alone. (Seems like both are true, which is contradictory). What would you say about the 3rd metric? (translating p-values to practical considerations)
I'm also not sure if using tail alone definitely leads to a worse/equal result (as measured by error, or some other metric) than tail + weight? If I reason from the fact that using more predictors will never reduce R-sq, then this statement is true?
Adding both features will not make worse predictions, but if they are correlated (like in this example) than there may not be a statistical difference in removing one of them (like in this example). However, if there is no statistical difference, then using both will not improve our predictions.
have you done any videos on how to do calculation of p-value from F-value? Also any video on explaining degrees of freedom? For example, why residual standard error is mentioned as 1.19 on 7 degrees of freedom at videotime 03:21?
See: th-cam.com/video/nk2CQITm_eo/w-d-xo.html
well done!
Thanks! :)
Hi StatQuest, I have a basic question. You mentioned at the beginning to plot the x and y variables(mouse weight and size in our case) to see whether there exists a linear relationship or not before proceeding with the modeling. Its easy when you have 2-3 variables. Suppose I have 50 variables in my data-set(assuming all are important) I assume it will be a big pain to plot graphs for all of these. What is the more efficient way to go ahead in such a scenario.
You can use lasso or elastic-net regression to pick out the variables that are important. If you want to learn more about this topic, start with my video on Ridge Regression: th-cam.com/video/Q81RR3yKn30/w-d-xo.html
@@statquest thanks a lot.
Can you explain how to use Robust standard errors like in Stata
Do you have an example of the predictor is non-numerical data? Thanks a lot
See: th-cam.com/video/NF5_btOaCig/w-d-xo.html and th-cam.com/video/CqLGvwi-5Pc/w-d-xo.html
Hey Josh Great Video! say we have 5 independent variables instead of 2 (i.e mouse length, height, weight, color, breed etc). Does this same format of understanding p-values apply: Is the lm function comparing an additive regression model of 4 independent variables (height, weight, color, breed) without the variable of interest (mouse length) to an additive model of the total 5 variables including the variable of interest (mouse length, height, weight, color, breed)? Or is it doing something else?
This is a good question. The p-value for a variable in a multiple regression is always a comparison between the full model, with all the variables, and the reduced model that is missing that specific variable. For example, if we had variables: length, height, weight, then the p-value for "length" is a comparison between the "full model", size = intercept + length + height + weight, and the "reduced model", size = intercept + height + weight. Does that make sense?
Perfect answer, thanks so much!
Hey Josh, hope your family and friend are doing well considering recent events. I have one more question. I know you work heavily with biology and genetics so this will hit home. I’m dealing with an anaerobic bioreactor. I’m trying to use multiple regression with methane flow from the reactor as the dependent variable, and mcrA (methanogen copy number log normalized by gram of reactor sludge) and temperature as my independent variables. Im using multiple regression in R to predict methane flow the p-value of Temperature is 1.47e-06*** and methanogens p-value is 1.84e-03** similar to your example, and they are not collinear. I would like to say that based on multiple regression analysis the combination of mcrA and temperature is not more statistically significant than temperature alone. That seems justifiable, correct? sorry for the dense questions
@@jahhedouglas242 What I would do is look at the adjusted R-squared for both variables, and then for each variable individually and see how much that changes. That will tell you how important it is to combine both variables for making predictions or not. If the change in R-squared is large, then it is important to use both variables. If not, then you can get away with using just one.
Thanks so much again Josh!
Hi Josh, thanks for your videos, they are very helpful.
I am a bit confused.
Isn't the explanation at 6:30 backwards? When you have the red square on weight, isn't that line saying you are seeing the effect of weight on the size adjusting for tail? If this is the case, then the formula below should be the following: we are comparing the complex model where "size=y-intercept + slope1 * weight + slope2 * tail" to the simple model "size=y-intercept + slope2*weight", not tail. Wouldn't this be right?
And likewise, at 6:55, that's telling you the effect of tail on size controlling for weight.
I am a graduate student and I am preparing for a talk where I have to talk about my data, and I use linear regression to model effect of genotype and treatment on certain behavior scores. In case they attack me on my analysis, I'd like to be able to interpret just about every line that R puts out, which this is part of. Your lessons are huge help :)
Any feedback from anybody is welcome
The explanation in the video is correct. On the line for "weight", the p-value tells you whether or not "weight" is important or not. Thus, we are comparing size = intercept + weight + tail vs. size = intercept + tail. If there is a big difference, then "weight" is important, and the p-value will be small.
If you want to double check this, here's what you do:
Step 1) Run the sample code that comes with this video: statquest.org/2017/10/30/statquest-multiple-regression-in-r/#code
Step 2) Make sure you understand how that code works - go through it with the video and go through the comments in the code.
Step 3) Now let's create a simple model of size = intercept + tail:
simple.regression.tail
@@statquest Hi Josh, I ran the codes you have suggested, and it makes sense now! Thanks for the timely response!
@@statquest I do have a questions though. So I get that we test whether weight is important or not by comparing the multiple regression with simple regression with tail. What if we have another independent variable, say food intake? in this case, the multiple regression model would look like
size~weight+tail+foodintake.
for the sake of argument, suppose we have the data and ran lm(size~weight+tail+foodintake), and suppose we take summary of that. We'd have intercept, weight, tail and food intake in the coefficients.
In this case, what comparison would R be making to come up with importance of weight, tail and foodintake?
In other words, what would R do to test whether weight is important on size or not, with 3 variables? Would it compare size~weight+tail+foodintake to size~tail or size~foodintake? Or would it compare size~weight+tail+foodintake to size~tail+foodintake?
I would like to understand what R is thinking in this situation, as you have described it with weight and tail example.
Can you help me understand this?
Thanks!
If we had weight, tail and foodintake, then the p-value for weight would compare the full model: size = intercept + weight + tail + foodintake to a model without weight: size = intercept + tail + foodintake. In other words, we only leave one variable out of the simple model.
@@statquest ohh ok it's clear now. Thank you so much!
Hello Josh, first of all, thank you for your videos. Could you please answer one question about the multiple regression summary output? You explained p-value of each coefficient one by one. You said that the p-value of `weight` is comparing `weight` + `tail` to `tail`. Was this a mistake? The p-value for `weight` is not for comparing `weight` + `tail` to `weight`?
What I said was correct. The p-value for "weight" compares the model with "weight" to the model without "weight".
Hi Josh my hero. Can you explain VIF? virance inflation factor?
I'll keep that topic in mind.
Excellent stuff, Josh. I may have missed it somewhere, but have you done/plan to do a break down of multilevel (hierarchical) regression?
I'll keep that in mind.
Is there a way to graph a scatterplot/linear regression in Excel? To have multiple lines for different covariates?
Definitely, but you'll have to google it. I can't remember how to do it off the top of my head.
Thank you for the video!
In a previous video you explained how to calculate the pvalue for the R^2. But how to calculate the pvalue for each coefficient?
I talk about that in th-cam.com/video/nk2CQITm_eo/w-d-xo.html and th-cam.com/video/zITIFTsivN8/w-d-xo.html
@@statquest Thanks!
Hi Josh . I would first like to thank you for your videos which have made complicated concepts easier to understand. I do have a question about the p-values in the multiple regression though. If there were more than two variables, is the p-value derived by comparing the linear regression model with the one variable in question, to the model using all the other variables, or does it always compare it to a single variable (simple regression)? If it is the case that it is always compared to a single variable, how is the variable for the simple regression model chosen?
Thanks
The p-value for a specific variable reflects the difference between the full model that contains that variable (and all of the other variables) to the reduced model that contains all variables except for the one of interest.
@@statquest Thank You.
Thanks !!
Hi Josh - Thank you for your video tutorial. I used your data sample and ran a simple regression (size ~ tail) and got a marginal slightly higher Adjusted RSQ (0.808 vs 0.800) and a much lower P-Value for the tail slope (0.0006 vs 0.0219) compare to the multiple regression (size ~ weight + tail). Does that make sense? Does that mean we can just use a simple regression instead? (Assuming the data is already available and no additional effort for measuring the mice tails needed.) - Anthony
Wow, I'm surprised you got a different result. Did you use my code or write your own?
@@statquest
Hi Josh,
Thank you for your prompt reply. Originally, I ran it on Excel (and the results were the same as the ones that you had in this video for multiple regression and simple regression (size ~ weight)), but I just ran it with your code for size~tail and it came out to be the same as the Excel version.
---------------------------------------------------------
Code:
simple.regression2 |t|)
(Intercept) 0.4774 0.5777 0.826 0.435848
tail 1.2355 0.2098 5.889 0.000606 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7844 on 7 degrees of freedom
Multiple R-squared: 0.8321, Adjusted R-squared: 0.8081
F-statistic: 34.69 on 1 and 7 DF, p-value: 0.0006059
---------------------------------------------------------
I have always thought that a multiple regression would always yields better RSQ and P-value!
I am wondering if I ran a Lasso Regression on the data, the model would drop the weight variable and just use tail instead.
- Anthony
Multiple regression doesn't always give you a lower p-value because of the degrees of freedom. However, it should give you at least as good, if not better, R-squared. And that's what I got:
When I just run "size ~ tail"...
summary(lm(size ~ tail, data=mouse.data))
...the raw R-squared (not the adjusted R-squared) is 0.83 and the p-value is 0.0006.
When I run "size ~ tail + weight"...
summary(lm(size ~ tail + weight, data=mouse.data))
...the raw R-squared (not the adjusted R-squared) is better, 0.85, however, the p-value, 0.003, is worse because because "tail" is highly correlated with "weight" and doesn't add much information compared (however, we have worse degrees of freedom because we need to estimate the extra parameter for weight).
@@statquest
Thanks again for your help, Josh!
So the multicollinearity problem causes the p-value in the multiple regression to be lower than if we run size~tail along? And shouldn't we normally compare the model by the adjusted R-squared?
By the way, since you are here, let me side track a little. In your gene and mouse example in other video (in Logistic Regression?), you said each gene is a variable. I am not in the biology field, so I do not know how to relate a gene can be a variable, could you elaborate and help me to understand?
(Nevertheless, As you might or might not be aware, there are typos in your texts in some of your videos? Do you care to know? Or you already knew and do not want to bother?)
@@anthonysun2193 1) Yes, we should always use the adjusted R-squared, and the adjusted R-squared gets smaller with more parameters. I only mentioned the non-adjusted R-squared because you said "I have always thought that a multiple regression would always yields better RSQ", and I wanted to point out that that statement was correct. However, multiple regression does not always yield a better "adjusted R-squared" because the adjustment accounts for the increased number of parameters. Does that make sense?
2) Here's an example: Some people have naturally black hair and some people have naturally brown hair. This is due to differences in the genes that code for hair color. So we can use that gene as a variable to separate people into different groups.
3) I am aware that my videos are full of typos. Because youtube does not allow me to edit my videos once they are posted, I am only interested in hearing about the typos if they result in confusion or in some sort of conceptual error. Depending on how bad the typo, I can mention it in a pinned comment, or I can delete the video.
Thanks for the video. I know this is for numeric data, but is there a model for predicting a categorical value? Eg predicting what career a student will choose based on scores in different subjects in high school?
Yes, it's called logistic regression. Here's a link to the videos: th-cam.com/play/PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe.html
@@statquest great! I’ll take a look. Thanks for the content and your help!
Hi,
in multiple regression and we are using weight, tail and ears (or any 3rd parameter) to predict size, in the R coefficients, the weight line will compare multiple regression vs single regression?
BR
YG
The weight line compares regression with and without weight.
Let's say that instead of writing size ~ weight + tail, we write size ~ weight*tail. how does the output change and how do we interpret that?
size ~ weight*tail suggests that there is an "interaction" between weight and tail, which means there is a non-linear relationship between weight and tail.
@@statquest Thank you so much for getting back to me so fast. Your response was super helpful!
@@statquest which means there is a non-linear relationship between weight and tail,why? why not the non-linear relationship between weight and Size? Moreover the concept of interaction just as well to qualitative variables or a combination of quantitative and qualitative variables.
Good morning. At the end of the video you prompt viewers for ideas. Well, I would like to propose you this: we are given these woooowww Bioinformatics books, you know, data mining for Bioinformaticians, Introduction to Bioinformatics, introduction to Genomics and Proteomics... and well, I will be honest as for my sensation: I think that their writing style fails to get the reader to the point. We spend the whole day on half the chapter and at the end of the day we have acquired no knowledge and we just got tired without any particular reason. I mean, their stat approaches are qualitative (for example, there is no k-means algorithm or the way it is explained makes no sense) and since an bioinformatician has usually no background in ... Biology or Medicine, we cannot understand what stat method should be used when. Things became really hard in Biomarkers discovery and it seems that 'incomprehensible' stat methods outcrop (something like Rambo, who would suddenly appear from inside mad, imagine something like that). I mean, from a qualitative perspective, all angelic. But not sure about the whole stat procedure. Would it be as a fatigue for you to make some videos on "When we (should) use what?" ? Also using a general, a catholic Diagram would help the most, I think. Thank you!
It's a good idea and I'll keep it in mind.
@@statquest Thank you for consideration. Well, in order not to be unfair towards the authors, I would like to mention the following: Perhaps books are just fine and eventually incomprehension may occur due to many professors' habit to select various chapters from various books, a strategy that modulates/leaves an impression of coherence/consistency. I really do not know! And, you see, there is no time to read two or three books from start to end within a semester! For example, I give a battle to end Fourier Applications book (this specific book is great!) as a free-time self-improvement and still I 've only gone up to chapter two so far due to other obligations. Thank you once again!
Could you please answer one question?
I did the SLR for Tail and the p-value was very small (0.0006) against 0.003 for MLR. The variance in R2 was also minimal (0.83 for SLR and 0.85 for MLR). In such case using only tail to predict mouse size seems better and it matches your inference.
Now I believe the comparison method shown in the MLR video would also apply to two MLR (one with lesser variables). In such case is it better to carry out the the comparison technique you showed (the one where we compared SLR and MLR to find comparative R2 and p-value)? Or surmising the same from the p-values of each variable of the MLR summary would suffice? Here i believe we can get the p-value but still need to check R2 and would need to carry out another MLR with reduced variables anyway.
If so, is there way to carry out the comparative method in R?
thanks for the video,U are the best video I watched so far in this subject matter. However, I don't quite understand the p-values under weight and tail row.
What is your question about them?
Hello StatQuest, I have a quick syntax question. Lets say you are trying to predict mouse size based off of like 50 features measured from the mouse. Is there a way not to have to the names of all 50 features when you get to the: lm(size ~ weight + tail + ...+ 50 feature) part of the code. I tried saving the column names of the data frame in an array and doing: lm(size ~ column_names_array) but I get an error.
Thanks in advanced.
I believe you use lm(size ~ ., data=your.data)
this video was very helpful. quick question - what made you come to the conclusion that an interaction exists between weight and tail through the graph visualization (approx. 5 mins)? is it because a pattern exists? I'm trying to understand how to identify an interaction between two numeric variables graphically. What would a scatter plot look like if no interaction exists - just highly randomized distribution of data with no identifiable pattern? I think that's right, but just want to confirm. Thanks for your help - these videos are a lifesaver.
At 5:00 I say that there is a correlation between weight and tail. Are you asking about the correlation? The term you used, "interaction" is specific statistical term that means something very different, so I want to make I understand what you are asking about.
@@statquest thanks. i'm wondering about interaction and think assumed they were the same. thank you for pointing out there's a difference between the two
OK. I may make a statquest on interactions in the future (hopefully near future).
When you called the summary of the multiple regression, the computer told you that the multi model was not more significant than the simple model using tail alone. If so, why was the P score in the multi variate regression under "Tail" a little worse than it was when you called a simple regression using tail only?
I am referring to 6:45 in the video
The p-value for the simple regression, at 2:52 , is for the model: size = intercept + weight. This p-value is 0.012 and it compares two models, the one we specified, size = intercept + weight, to the null, which is size = mean value for size. The second p-value you, which is the one was asked about at 6:45, is for the model size = intercept + weight + tail compared to the model size = intercept + weight. The p-value, 0.0219, tells us that the the model is significantly better when we add "tail".
@@statquest beautifully explained!
This was a great video. Does the same applies for multiple regression models with one numeric and one factor variables (for explaining the coefficients)? I understand that you can't design a pairs plot with factor variables as they need a box plot but are the rest of the steps same as with 2 numeric? Also, don't we need to run a diagnostic plot of our model to check for assumptions validations? Great videos!
The same applies with numeric and factor models. For details: th-cam.com/video/Hrr2anyK_5s/w-d-xo.html And you should always check your assumptions and outliers.
@@statquest Thank you!!
tx sir.
bam! :)
Thank you for your nice and organized presentation. Could you refer some statistics books for biologist which you find helpful for biologist?
Unfortunately, I don't know of any good books for biologists.
Fundamentals of Biostatistics (Rosner, Fundamentals of Biostatics)
thank u for this. trying to create and plot a system for sales where the recommendation is coded in 0 and 1. tried to plot but I am getting straight lines on two cut offs say price points.
would you appreciate your help in this. to predict recommendation
this is for a training project for introduction to R
It sounds like you need to be doing logistic regression. For details, see: th-cam.com/play/PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe.html
Thanks for these amazing videos. Great explanations. One question, how do I add the output to the plots? Sorry for asking for R code, I find your codes easier to use and understand.
I'm not sure I understand your question. What time point (minute and seconds) are you asking about?
@@statquest my question is, what R code can I use to add R squared and the associated p value to the plots?
@@kennethssebambulidde9238 See: stackoverflow.com/questions/3761410/how-can-i-plot-my-r-squared-value-on-my-scatterplot-using-r and lukemiller.org/index.php/2012/10/adding-p-values-and-r-squared-values-to-a-plot-using-expression/
how do you plot the multiple regression line? Like in the first example you used abline(regression) and I thought it might also work for the multiple one, but abline(multiple.regression) doen't work.
I don't think there is a straightforward way to do this. However, I found this discussion on how to do it by hand: stackoverflow.com/questions/17615791/plot-regression-line-from-multiple-regression-in-r
@@statquest Many thanks! I'll check that out
Hi Josh. Thanks for the excellent videos, it really helps me to understand statistics. I have a question about different types of statistical methods: If you want to categorize different statistical methods, is it right to say that there are three main categories called:1)Orthodox statistics, 2)Data-Driven methods, and 3)Bayesian inference?
Hmm.... I usually think of two categories of statistical methods: frequentist and bayesian. Can you give me an example of a "data-driven" method?
@@statquest Sure. By data-driven I mean genetic algorithms, ANN, and machine learning.
@@nasrintaghavi6877 Ahh! I see. It's a good question, and a rather philosophical one. First you have to define Statistics. Then you have to answer the question, "are all machine learning algorithms a type of statistics?" I don't know the answer to that. I agree that there is a category of "data-driven methods" - however, I'm not sure all of them would also be considered statistics. It is my hope someone else might chime in on this thread and give their thoughts. If not, I'll post it to the StatQuest community page.
@@statquest Yeah, actually that's the problem. I don't know if data-driven methods are in the statistics category at all, and if yes, are they in Frequentist category or they are another category by themselves. I'm reading a review paper in my field and they have categorized machine learning in the statistical category but haven't devided statistical methods into sub-categories.
@@statquest Also, do you know any good reference that has explained statistical methods' categories clearly?
whats the different with multivariate regression. COuld u please make video about this 🙂
I believe that multivariate regression allows more than one outcome or response variable. I guess it would be like having multiple y-axis variables.
But looking at the weight x tail graph, seems like there's a correlation between the two variables. So shouldn't one of the variable be removed to prevent multicollinearity?
That's what we ultimately do at the end.
@@statquest rewatched the end and yeah you did. It wasn't explained it in terms of multicollinearity though (unless I missed it?) which was why I asked haha.
On a seperate note, I'm trying to learn more about interaction terms in multiple regression but didn't come across any video from you about it. I'm trying to run a regression model but the interaction terms have high vif (which is probably obvious since it's an interaction term). Should the interaction term still be included?
@@rongxuantan5406 What is the goal of your model? If you just want to make predictions, you can apply regularization to it (Elastic-Net) and that will take care of any multicollinearity problems.
@@statquest thanks for replying! I'm trying to see how a change in the independent variables will affect the dependent variable. So basically I'm trying to find out the coefficients and whether the variable has a statistically significant effect
@@rongxuantan5406 In that case, you might still try regularization (lasso) and see what variables remain at the end.
Thx for the video.
But i have one question, where can i see the values for sloope1 and sloope2?
The slopes are in the "Estimate" column of the summary() output. See 5:50.
I have written a code but I ran into a problem. Can I send the code to you so that you could let me know the problem?
I have seen in minitab an option where you can get a table that summarizes information (r sq, r sq adjusted, mallows cp and vif) for an dependent variable vs every COMBINATION of independent ones. Is there something similar in R and how can I do that?
Can you or anyone interprete this :
Residuals:
Min 1Q Median 3Q Max
-84.878 -26.878 -3.827 22.246 99.243
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.566e+02 1.232e+02 -4.518 4.34e-05 ***
incpc 7.239e-02 1.160e-02 6.239 1.27e-07 ***
pop 1.552e+00 3.147e-01 4.932 1.10e-05 ***
urbanpop -4.269e-03 5.139e-02 -0.083 0.934
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.47 on 46 degrees of freedom
Multiple R-squared: 0.5913, Adjusted R-squared: 0.5647
F-statistic: 22.19 on 3 and 46 DF, p-value: 4.945e-09
The way to interpret this output is described at 5:45
@Josh Starmer Can you please throw some light on multicollinearity and Variation Inflation Factor (VIF). These are useful concepts when it comes to multiple regression.
Awesome! Do you have plans for polynomial regression?
Thank you! I did not have plans for polynomial regression - but I could make a video pretty easily - it's the same as multiple regression except instead of just plugging in "weight" to predict "size", you plug in "weight-squared", or "weight-cubed" to predict "size". That's all there is to it.
@@statquest How can I determine the proper degree? What's the intuition behind it?
This is a great question -- the answer is..."it's complicated". Sometimes you know in advance what the relationship is. If not, you can plot the data and see if the relationship is obvious. If not, you can start with a simple model. For example, say like you want to use "age" to predict "size". The model might be, size = y-intercept + B1 * age + B2 * age^2. So that would be a polynomial of the form, y = a + bx + cx^2. You can then look at the residuals and see if they are relatively random or form a pattern. If they are relatively random, you're then look to make sure the coefficients, B1 and B2 are both significantly different from 0. If so, you are done. If not, then you can probably remove that term from the equation. If the residuals form a pattern, you can add another term... age = y-intercept + B1*age + B2*age^2 + B3*age^3... etc. Does that make sense? The length of this answer is a good argument for me to make a video, but it might be a while since I'm working a lot on the machine learning stuff right now.
@@statquest thanks!! So it is all about tuning not about interpretation?
I am waiting for more machine learning stuff!! Thank you!
Renee Liu I believe so. More machine learning in the next week or two! :)
Hello Josh please help me to understand below points:
1. If we are considering p value to identify whether variable is significant or not then why don't we remove weight from final model.
2. After adding weight and tail in model our p value is significantly improved what is the reason behind that?
3. If combinations of weight and tail improve model then weight also needs be significant why it is not?
Thanks in Advance
Hi. Is there any way to perform multiple linear regression on raster time series images?
Maybe! I've never done it before.
@@statquest okayy thannxxx
Could you go through some other basic data visualisation and stats methods in R? Your PCA video was very handy, but I'd love help with e.g. CVA, MANOVA
after regression for the 3 variables, how do we plot all 3 in a given space together? abline was used for simple regression but here using it, cant be used for 3 variables.
You can't, unless you're using a 3D graph. Otherwise, you can use Principal Component Analysis for reducing the the number of variables from 3 to 2, which you can then plot (there are some videos in the channel explaining how to do this)
Do you have a video explaining what it means to control a variable statistically? Some of the explanations I'm finding on TH-cam just use the words in the term as the definition and it's not intuitively clear to me to sink in :/ Btw thanks a lot for these clearly explained videos 👍😬
I don't have a video on the topic, but the idea is pretty simple. If you want to find out if a drug is effective or not, you give it to one group of people and another group of people do not get the drug and a 3rd group of people get a placebo. This third group of people controls for random things happening that are not associated with the actual drug and it is an example of "controlling for a variable".
@@statquest Thank you. I understand it from that experimental design perspective, but for some reason I can't visualize it when someone says it about controlling variables in a statistical model. Say for data a researcher didn't collect. Are they essentially running a multiple linear regression and comparing the output to see if the model has a better R-squared than one with the single variable like you described in this video?
@@daneshj4013 I'm not sure I understand what you mean. In my example, we have three variables: One variable controls for someone not taking any drug at all by representing that set of measurements. One variable that controls for the placebo affect by representing that set of measurements. And one variable that controls for the effect of the drug, by representing that set of measurements.
@@daneshj4013 If I understand you correctly, you are talking about batch (or confounding variables) effects. Data was collected multiple times, by multiple techs, in multiple labs. Or maybe the data is sequencing data that is from the same experiment, but was run on different machines in order to achieve full read depth. You would use the same design as described in the video, but you should also perform PCA to determine if batch effects exist. Even if you are missing essential information about the samples (say gender or machine type), PCA might force you to ask better questions that can be used to direct future research.
@@statquest I think he meant, "What do they mean when they say to make a linear model and control for [blahblahblah] variable?" As in, including the variable in your model = controlling for the variable in the model. "The effect of blahblahblah variable when controlling for all other variables in the model is ...."
a small doubt. why don't you check multicollinearity? if we have more than one independent variable, that is needed.
Sure, you can definitely do that.
@@statquest thanks. It is defiantly not easy to give all the potential steps to do a real trustworthy regression. I guess the analysts need to get experience themselves.
@@juanwang3705 Alternatively, you can apply ridge or lasso regression techniques do deal with this type of problem automatically. For details, see: th-cam.com/video/Q81RR3yKn30/w-d-xo.html th-cam.com/video/NGf0voTMlcs/w-d-xo.html th-cam.com/video/1dKRdX9bfIo/w-d-xo.html and th-cam.com/video/ctmNq7FgbvI/w-d-xo.html
hi Josh,loved your video.
Can you hint on a multilinear regression formula with a constant slope and changing intercept.
If the slope is constant, then you can just plug that into the formula. For example y = intercept + 23 * height. In this case, we plug in 23 for the slope, but the intercept is still something we solve for.
th-cam.com/video/LoocDAbgwlM/w-d-xo.html
Thank you for your video! However, if I use more than 2 variables, e.g. 5 variables, to build a linear model, how could I interpret the summary of the linear regression. Especially the 6 lines for intercept and 5 variables, what will the p-value means for each variable? Will it compare the model to the model without that variable? Thanks so much.
I had the same question. For example if you have 5 variables (a,b,c,d,e) and want to see if you can predict a by multiple regression of b,c,d and e, so your equation would look like this : a = y-intercept + slope1 x b + slope2 x c + slope3 x d + slope4 x e. You will get 4 lines in your coefficient part of summary. Each line would correspond to b,c,d or e and have a p-value (Pr(>|t|) at its end. For example, you would get a p-value = 0.45 for line b, which would mean that you compare the multiple regression model (previous equation) to another multiple regression model in which you take into account all other variables except b! The compared equation would look like this: a = y-intercept + slope2 x c + slope3 x d + slope4 x d. So in some way you're checking how b is helpful in your data to predict a. In my example, the p-value = 0.45 > 0.05 (if alpha = 0.05) so it means that the multiple regression model including the b variable is not significantly better to predict a compared to the multiple regression model without the b variable. In conclusion, you could get rid of b to predict a. etc etc for each line
@@melaniee467 Thanks Melanie
@@melaniee467
BAM!! :)
:)
hi Josh can you explain to me why we use towby2 command in R
Sorry, I've never used the twoby2 command.
I wanted to kill myself after that intro but I've grown to love it now. But anyway thanks for the stats help, keep it 100.
Thanks! :)
For each explanatory variable you need to have at least 10 observation for each as of the thumb rule I know. But your sample size is too small!
Noted!
First #craycray
Thank for the video, however, the intro is very bad, especially the singing - why would you ruin the video with that intro?
Because I like to sing.