I loved this video. A nice follow up, would be a video where you go much deeper into the theory and explain the math behind these kind of plots. Thank you.
Right vs. left skewness is depicted the opposite way. The picture on the left is skewed to the left, and the picture on the right is skewed to the right.
Are you talking about the plots at about 8:30? The left plot has fewer observations strung out at higher values, which corresponds to right skewed (skew goes in the direction of the long tail). The reverse is true for the plot on the right.
@@OpenIntroOrg Thanks for the response. I am sorry, I was wrong. It seems, one cannot decide skewness from the histogram which could be drawn based on the first examples in this video. Because the value axis goes from high values to low in those histograms. They would need to be "mirrored" first in order to decide skewness.
@@OpenIntroOrg Skewness specifically indicates that the MEAN of the data set is not equal to the MEDIAN of the data set. Side note for others: on the histogram, lower values of the data are to the left with higher values of the data to the right. So a RIGHT skewed data set means that the MEAN of the data set is higher that the MEDIAN of the data set. There will be a higher density of observations to the left on the histogram. This concept seem opposite of what the histogram looks like, but the skewness is determined by the calculations from the data set. A LEFT skewed data set means that the MEAN is lower than the MEDIAN. There will be a higher density of observations to the right on the histogram. In a perfectly normal data set, the MEAN and MEDIAN will be approximately equal.
Recipe for QQ-plot (quantile-quantile) in R: ## In R, a key observation is that the "pnorm" and "qnorm" functions are inverses of each other. ## To construct a QQ-plot of N observations (random samples here): ## ## Number of observations nn
so, the x axis here is the z score values and the y axis is the actual values? and plotting it against one another as seen here, we should see how it lines up? the better the linearity, the more 'normal' the distribution?
Basically yes :) The x-values are the Z-scores we expect if the population and sample are as perfectly normal as it could be. So the straighter the line, the more encouraging that the data are nearly normal. That said, no population is perfectly normal, and even a sample from a truly normal distribution will not be perfectly straight just due to random sampling. That said, the main goal of this type of plot is often as a basic check to ensure nothing too wonky is going on and the population is roughly normal.
In the textbook, I found the QQ-plot explanation to be lacking. Here, too, a number of key attributes are missing. First off, we must order the empirical observations (y-axis), as noted in previous comments. An explicit definition of "quantile" in earlier lectures would set the stage here, motivating "theoretical quantiles": the quantiles of the standard normal associated with the empirical probabilities (e.g. regularly spaced probabilites).
Hi Gunning, thanks for the feedback. In short, this is a "special topic" that isn't covered in most intro stat courses (though some do cover it), so we breeze through on theory here and get to the practical application of the method. We don't expect anyone to walk away from this video able to reconstruct this type of plot -- only be able to read one.
I updated my comment to put the "recipe" in a separate comment for curious readers. For context, I'm currently using the text to teach intro stats. This is my first semester with the department, but the department has used this text for several semesters. I absolutely understand the concern about special topics and coverage. My *personal* feeling is that the text should either include a discussion of QQ-plot along with 3-4 sentences of discussion of construction, or omit it altogether. That said, I would argue that understanding how the plot is constructed is critical to correctly reading it!
Thanks for the video. How to generate the line for non-normal distributed data? I can understand that for the normal distributed data, the line has slope of STD and intercept of mean, then the x axis value is z score and y axis value is the actual data value. But how about the non-normal data set? how exactly to calculate the x axis value for each data point? how to calculate the y values for the straight line?
The line doesn't quite represent this, especially when the distribution has longer tails than a normal distribution, so it is good to calculate the sample variance separately. Also, sorry to nitpick, but a clarify to avoid confusion for others: we'd describe "population variance of a sample" as "sample variance", and to further remove any ambiguity, we divide by (n-1) when computing the sample variance (while population variance is often computed by dividing by n).
So, my data is skewed and non-normally distributed - What's to be done? Do I perform some transformation to force normality, or do I rather just perform non-parametric tests?
Unfortunately, it's easier to say "something might be risky or broken here" than it is to say "this is how to fix it". What is required will be highly dependent on the circumstances, both the data and what the goals are of the analysis: - If the sample is large enough and / or the skew isn't severe enough, then non-normality will not matter for some statistical methods. For example, if all your observations are within ~4 SDs of the mean, there are 30+ observations, and the method being applied is a t-confidence interval for the mean, then the skew isn't much of a concern because the Central Limit Theorem will have kicked in to the point the skew won't matter much. - A more robust method might help. However, be aware that not "nonparametric" does not automatically mean "robust". For instance, the bootstrap percentile method is less robust than t-distribution methods when the sample size is relatively small (
Well, The name is Normal probability plots. a) Why are they called Probability plots? b) Why the plot between the observed data and z score is supposed to be a straight line? Well I can understand if the data fits well its a measure of goodness of the fit, however, I dont understand why this has to be a straight line
Data is never perfectly normal, so you're in good company. Check out OpenIntro Statistics Section 7.1, which offers a couple of rules of thumb on the bottom of the first page of that section. The book is free online as a PDF from our website, see: www.openintro.org/book/os
Best explanation on TH-cam for this topic, thank you.
I agree with other comments. This is the best explanation of this topic on TH-cam
I loved this video. A nice follow up, would be a video where you go much deeper into the theory and explain the math behind these kind of plots. Thank you.
I can't explain how best the video was. thanks 😊
Right vs. left skewness is depicted the opposite way. The picture on the left is skewed to the left, and the picture on the right is skewed to the right.
Are you talking about the plots at about 8:30? The left plot has fewer observations strung out at higher values, which corresponds to right skewed (skew goes in the direction of the long tail). The reverse is true for the plot on the right.
@@OpenIntroOrg Thanks for the response. I am sorry, I was wrong. It seems, one cannot decide skewness from the histogram which could be drawn based on the first examples in this video. Because the value axis goes from high values to low in those histograms. They would need to be "mirrored" first in order to decide skewness.
@@OpenIntroOrg Skewness specifically indicates that the MEAN of the data set is not equal to the MEDIAN of the data set. Side note for others: on the histogram, lower values of the data are to the left with higher values of the data to the right.
So a RIGHT skewed data set means that the MEAN of the data set is higher that the MEDIAN of the data set. There will be a higher density of observations to the left on the histogram. This concept seem opposite of what the histogram looks like, but the skewness is determined by the calculations from the data set.
A LEFT skewed data set means that the MEAN is lower than the MEDIAN. There will be a higher density of observations to the right on the histogram.
In a perfectly normal data set, the MEAN and MEDIAN will be approximately equal.
@@aCllips Thanks for this explanation
Probably the best explanation video out there
I thought so too.
Recipe for QQ-plot (quantile-quantile) in R:
## In R, a key observation is that the "pnorm" and "qnorm" functions are inverses of each other.
## To construct a QQ-plot of N observations (random samples here):
##
## Number of observations
nn
Here is exactly what I was looking for. Thank you very much!
just what I was searching for ....... Nice job !!
so, the x axis here is the z score values and the y axis is the actual values? and plotting it against one another as seen here, we should see how it lines up? the better the linearity, the more 'normal' the distribution?
Basically yes :) The x-values are the Z-scores we expect if the population and sample are as perfectly normal as it could be. So the straighter the line, the more encouraging that the data are nearly normal. That said, no population is perfectly normal, and even a sample from a truly normal distribution will not be perfectly straight just due to random sampling. That said, the main goal of this type of plot is often as a basic check to ensure nothing too wonky is going on and the population is roughly normal.
seems like it
Simply awesome! Thanks for shring this!
In the textbook, I found the QQ-plot explanation to be lacking. Here, too, a number of key attributes are missing. First off, we must order the empirical observations (y-axis), as noted in previous comments. An explicit definition of "quantile" in earlier lectures would set the stage here, motivating "theoretical quantiles": the quantiles of the standard normal associated with the empirical probabilities (e.g. regularly spaced probabilites).
Hi Gunning, thanks for the feedback. In short, this is a "special topic" that isn't covered in most intro stat courses (though some do cover it), so we breeze through on theory here and get to the practical application of the method. We don't expect anyone to walk away from this video able to reconstruct this type of plot -- only be able to read one.
I updated my comment to put the "recipe" in a separate comment for curious readers. For context, I'm currently using the text to teach intro stats. This is my first semester with the department, but the department has used this text for several semesters.
I absolutely understand the concern about special topics and coverage. My *personal* feeling is that the text should either include a discussion of QQ-plot along with 3-4 sentences of discussion of construction, or omit it altogether. That said, I would argue that understanding how the plot is constructed is critical to correctly reading it!
Thanks for the video, it's been helpful. Kudos
Simply splendid
Very useful!! Thank you
Thanks for the video. How to generate the line for non-normal distributed data? I can understand that for the normal distributed data, the line has slope of STD and intercept of mean, then the x axis value is z score and y axis value is the actual data value. But how about the non-normal data set? how exactly to calculate the x axis value for each data point? how to calculate the y values for the straight line?
so touching for an excellent video
Very instructive, thanks
so why it is so? why dont u explain the reason for not fitting the line
Can we use the slope of the probability plot to measure the population variance of a sample?
The line doesn't quite represent this, especially when the distribution has longer tails than a normal distribution, so it is good to calculate the sample variance separately.
Also, sorry to nitpick, but a clarify to avoid confusion for others: we'd describe "population variance of a sample" as "sample variance", and to further remove any ambiguity, we divide by (n-1) when computing the sample variance (while population variance is often computed by dividing by n).
So, my data is skewed and non-normally distributed - What's to be done?
Do I perform some transformation to force normality, or do I rather just perform non-parametric tests?
Unfortunately, it's easier to say "something might be risky or broken here" than it is to say "this is how to fix it". What is required will be highly dependent on the circumstances, both the data and what the goals are of the analysis:
- If the sample is large enough and / or the skew isn't severe enough, then non-normality will not matter for some statistical methods. For example, if all your observations are within ~4 SDs of the mean, there are 30+ observations, and the method being applied is a t-confidence interval for the mean, then the skew isn't much of a concern because the Central Limit Theorem will have kicked in to the point the skew won't matter much.
- A more robust method might help. However, be aware that not "nonparametric" does not automatically mean "robust". For instance, the bootstrap percentile method is less robust than t-distribution methods when the sample size is relatively small (
Very Helpful
Well, The name is Normal probability plots. a) Why are they called Probability plots? b) Why the plot between the observed data and z score is supposed to be a straight line? Well I can understand if the data fits well its a measure of goodness of the fit, however, I dont understand why this has to be a straight line
im doing my thesis rn, and the data is not normal, what to do with this? 😭😭
Data is never perfectly normal, so you're in good company. Check out OpenIntro Statistics Section 7.1, which offers a couple of rules of thumb on the bottom of the first page of that section. The book is free online as a PDF from our website, see:
www.openintro.org/book/os
Thank you!
thank you
Very helpful!
Good one