NOTE: This StatQuest was brought to you, in part, by a generous donation from TRIPLE BAM!!! members: M. Scola, N. Thomson, X. Liu, J. Lombana, A. Doss, A. Takeh, J. Butt. Thank you!!!! Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@@lcoandrade He has no answer to your question because p values are really one big hack that don't prove anything at all. Search for: "Cohen - Things I have learned so far."
I studied Industrial engineering, however, engineering tends to have a way too "practical" approach in Mexico, so my statistics teachers were more focused on "formula -> result -> verdict" rather than a true explanation of why we do things. This series has helped me a lot in understanding the whole movement behind some of the everyday uses i give to my statistics knowledge. That's truly worthy of a triple BAM! Thank you for sharing your knowledge.
Good contents indeed. Sometimes students find it easy to use critical value method but feel kind of difficulty with p value. Your content explains clearly!
So tempted to send this video to my former PhD director... 🙄 Thank you for your immensely valuable content!!! Learning more on your channel than in class 😅
Congratulations and thank you for the videos! I appreciate the clarity, simplicity, and humor in your content. While you probably don't need more compliments, I wanted to express how much I enjoy it and how I find it to be a brilliant way to convey key concepts in statistics. BAM!
I am a great fan of your work. Here is an idea/need to enhance learning: It would be helpful that after a video or a set of videos you give us problems (or homework) to learn better. In further videos, you could give us the answers. Thank you
LOVE YOUR VIDEOS!!! After discovering one of your videos, I started to review my statistics from the very basic. Here is a little question. I can't figure out how to calculate the p-value from the statistical test that compares the means of two sets of samples, based on the things I learned from the last one 'How to Calculate p-values'. In 'How to Calculate p-values' I understand calculating the p-value of one event can see whether the occurrence of one event is that special (
It sounds like you are asking how to calculate a p-value for a t-test. There are two ways to do it. The "normal way" (and I don't have a video for that), and the "more flexible linear model" way. I have a video for that. Here's the playlist: th-cam.com/play/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU.html
Thank you for the video! I have a question though. Since we are testing different drugs, why do we need to consider the p-values of other drugs for False Discovery Rate? I thought these were independent events.
I believe this is a intrinsic problem the way "expediction experiments" works. Even if the drugs are independent, given the multiple testing there is the problem with inflation of type I errors. The same would occur by measuring 100 uncorrelated variables from a homogenous two sample (null). You'd expect false discoveries even if the variables are themselves independent and equall between the groups. Also, the are procedures for FDR based on independent and dependent scenarios, eg., the BH and BY methods.
I almost understand the first example about drug A to Z, except for one difficulty. If we assume that the distributions for each drug are independent, from the perspective of drug Z, there’s only one test. Then this will not have the multiple testing problem.
If something has a 5% chance of happening, then 5% of the time we do it, we'll get a false positive. So now imagine we did 100 tests for all 27 drugs. That means 5 of the tests for drug A are false positives, 5 of the tests for drug B are false positives, etc. So we have 100 rows of results, with mixtures of true negatives and false positives. When we test each drug 1 time, we get one of those rows, and it's very likely that it will include at least one false positive.
Thank you so much for explaining probability concepts with such good pedagogy (and positive energy). These lessons are usually soporiphic. Question: why do we need to re-test new samples of recovery WITHOUT any drug every time? Can't we just compare the results of each drug to the results of "no drug" ? As in, in practice, we would only have 1 placebo group against which we compare the results of a given drug, wouldn't we?
Sure, if you did all the tests at the same time, you could use the same placebo group for all the tests, however, that doesn't change the possibility of false positives.
In the first example going through drugs A...Z you dealt with rejecting the null hypothesis that theres no difference between drug Z and not taking a drug. The next example deals with taking samples from a distribution of people who did not take a drug. At 4:30 what does it mean the null hypothesis is the two groups came from the same distribution? What does it mean that one would come from a different distribution?... Would it correspond to a different distribution of people who took medication instead? Im doing my best to follow along but this one has me confused. P.s thank you for the wonderful videos!
It might be helpful to first watch my video on how to calculate p-values: th-cam.com/video/JQc3yx0-Q9E/w-d-xo.html and my video on the null hypothesis: th-cam.com/video/0oc49DyA3hU/w-d-xo.html However, to answer your questions, we're trying to decide if there is a statistically significant difference in the two samples (note: there will always be differences in the two samples due to little random things that happen, so just seeing small differences isn't enough to decide that there is a profound difference between the two groups). If the difference is statistically significant, that suggests there is a true difference - and maybe one drug works better than another. However, the statistical test is not perfect. Sometimes the statistical test will say there is a difference, even when we know there isn't one (because both groups of measurements came from the same "distribution" of people - for example, people that all took the same drug, or people that didn't take any drug at all). When the test fails, it suggests that there is a profound difference in the groups, even though we know there isn't one.
The P-hacking culture is so endemic that sometimes I seriously get so demotivated that I wonder leaving the field entirely. It's depressing. Thanks, Josh. Sorry about the vent.
Big fan of StatQuest, really appreciate the work and humor you put into this. Just a question on the approach, why do we need to have different control groups for each of the drug for comparison? Can we have one control group that doesn't take any drugs, and have it to compared with all the treatment groups that take different types of drugs. Is there a pros and cons of doing this in comparison to what's done in the video?
@@statquest Thanks for the prompt response. Yeh I agree that's by any mean not the solution to the p-hack problem or even attempt to mitigate it. I was wondering what is the pros and cons of having different control group for each drug vs having one control group for all. Under what scenario should i pick one approach over another?
@@alsonyang230 It all just depends. You want to control for all things other than the drug or "treatment" or whatever you are testing. So, if you can do everything all at the same time, you could just collect one control sample. But if there are changes (like time of year, or location), then you need to get extra controls.
Awesome video as always! Would love to hear your thoughts on moving the standard alpha to 0.005 or some alternatives to p-value reporting (surprisals/s-values) in a future video too!
When we talk about Bayesian stuff, we'll talk about alternatives to the p-value. As for changing the "standard" threshold for significance. That's always been a cost/benefit/risk balance and ends up being field specific. For example, in a lot of medical science, the threshold can be as high as 0.1 because it makes sense in terms of cost/benefit/risk.
Hi Statquest great video! I watched this video and your power analysis video and I have one quick question. If you already collected preliminary data, can you perform a power analysis on that data or would that be considered p-hacking as well? Thanks
You can do it with something called a 't-test'. There are two ways to do t-tests, the traditional way that is very limited, or you can use linear regression and it opens up all kinds of cool possibilities. If you want to learn how to do it with linear regression, see: th-cam.com/play/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU.html
When we compute a p-value, we select a "null distribution", which is usually "no difference". If the null hypothesis is correct, and there is no difference, then a p-value threshold of significance of 0.05 means that there is a 5% chance we'll get a false positive.
@@statquest so can i understand in this way: say the p-value is 0.05, when our observed data lays on the 5% of null distribution, there is also 5% of chance we incorrectly reject the null hypothesis if we decide the rejection, but we are still 95% confident to reject the null hypothesis.
@@alexlee3511 The p-value = 0.05 means that there's a 5% chance that the null distribution (no difference) can generate the observed data or something more extreme. To be honest, I'm not super comfortable interpreting 1 - 0.05 (alpha) outside of the context of a confidence interval.
This series is fantastic. For the rest of the pandemic, stats instructors should just kick their feet up and redirect their e-campus courses to this channel.
@@statquest I am not trying to justify the correctness of the number. Just wondering how to calculate the p-value in cases like the example. Do not recall you mentioned in previous videos. Thanks.
@@chendong2197 With this type of data we would use a t-test. I explain these in my playlist on linear models(which sound much scarier than they really are): th-cam.com/play/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU.html
If each drug trial was a different separate study though, wouldn't it mean the p-value false positive would still stand? Because the authors are acting independently? So you get different results depending on if one person trialled or the drugs or if one person trialled each drug.
The way the p-value threshold is specified to be 0.05, we expect 5% false positives - and, overall, this is a good trade off of cost/benefit/risk/reward. So, if different people are doing tests, sure they will will get false positives from time to time. But the goal is to limit them within our own study, so we adjust the p-values with FDR.
Great video! I have a question about drug testing. Is it better to perform an experiment that tests all of the drugs at once instead of testing each candidate one by one? Besides, if I want to test 6 candidates versus control, each of them has three tech replicates, and I performed the same experiment three times. Should I use the mean of the three tech replicates from the three times of experiments to calculate the p-value? Or are there better solutions for the experiential design? Look forward to your reply!
If you do all the tests at the same time, that's called an ANOVA, and you can do that - it will tell you if something is different or not. However, it won't tell you which test was different. In order to determine which test was different, you'll still have to do all the pairwise tests, so it might not be better. So it depends on what you want to do.
@@statquest Many thanks for your reply! What about the technical replicates? If I have 6 technical replicates for each biological replicate, should I use the mean of technical replicate to stand for each biological replicate?
@@yijingwang7308 If you have technical replicates, then you probably want to use a "general linear model." Don't worry, I've got you covered here: th-cam.com/play/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU.html
Thank you A LOT for this video!!! Please, answer my question: which correction method for p-values should I use, if I make several (say, 30) comparisons, but each 2 groups of observations come from different distributions? To be specific, I compare two variants of 5 different enhancer sequences, each in 6 cell lines, using luciferase reporter technique.
I should have clarified that it doesn't matter if you mix and match distributions. When we do multiple tests, regardless of the distribution, we run the risk of getting false positives. So I would recommend adjusting your p-values with FDR (false discovery rate).
Can you please explain Why increasing the sample size after measuring the p-value would increase the likelihood of false error? I would think it would be the opposite. If you're adding 2 observations to each set, it's more likely that each of these observations are closer to the distribution mean than far away from them. This would imply that the sample means of both sets are more likely to come closer (p-value would increase) than for them to move apart (p-value would decrease)
Unfortunately that is not correct. Even if the new points are closer to the mean, there is a 25% chance that the new point for the observations with higher values will be on the high side of the true mean AND the new point for the observations with the lower values will be on the low side of the true mean.
Can you nuance what is meant by one more measurement and adding more data. I assume they are the same in this context regarding " a subtle form of p-hacking' segment
in the experiment 0:29 - 2:18, what if the drug z was actually a better drug? p hacking would happen if we test the results of drug z itself multiple times until we get a false positive. In example mentioned above we just tested 1 drug relative to other drugs, which gave a p value less than 0.05, so why would that be a false positive?
If Drug Z was actually a better drug, then the times when we failed to detect a difference would be false negatives and the time when we detected a difference would have been a true positive. So that means it is possible to make both types of mistakes - if Drug Z is not different, then we can have false positives and if Drug Z is different, then we can have false negatives. The good news, is that we can prevent false negatives by doing a power analysis. For details, see: th-cam.com/video/Rsc5znwR5FA/w-d-xo.html and th-cam.com/video/VX_M3tIyiYk/w-d-xo.html
Let's say I am developing a new method of analysis and it gives me a p-value. Is it ok to keep changing my method to get a better p-value? Would this be p-hacking as well? Thank you for your videos! It helps me a lot!
I don't think so. For example, if you ran the wrong test on your data, then realized your mistake and ran the correct one, I don't think you should be penalized for that. However, in your example, you have to make sure that when you modify your test, you're not modifying it just to get a small p-value. Instead you are modifying it in ways that are statistically justifiable in a broad sense.
Hi I still don’t understand at 1:47 section, why is it p-hacking for rejecting the null hypothesis for drug Z? They aren’t like the latter example where they’re taking the samples from the same population? I presumed we call it a different population for each drug the samples are put on
Say like we have 100 drugs and none of them are effective - they are all variations on a sugar pill. Then we test all of them like in this example. Well, due to random sampling of people taking the pill, there's a good chance that in one of those tests, a bunch of health people take one pill and a bunch of sick people take the other. This will make it look like there is a difference between the two pills even though it's not the pill, it's just the people that take the pill. Thus, this is p-hacking because it looks like the pills are different.
How do you calculate a p-value based on two means (from two samples) - isn't a p-value calculated between a number and a distribution? what is the exact process of using the two means parameters to calculate a p-value on a distribution?
We can calculate p-values for two means using something called a t-test. For details, see: th-cam.com/video/nk2CQITm_eo/w-d-xo.html and then th-cam.com/video/NF5_btOaCig/w-d-xo.html
Amazing! I have a question about Benjamin-Hochberg method. Is this method only applicable to parameter tests such as t test and chi-square test, or is it also applicable to non-parametric tests such as Wilcoxon Signed Rank Test. Thanks a lot.
Hi, thank you for your great work ! However, there is something that I have a hard time processing. I understand the last example of p-hacking, but we are expected to do the same experiments 3 independent times. How do we analyze and represent these results without p-hacking ?
Is is really true that you are expected to do the same experiment 3 times? Or just that you are supposed to have at least 3 biological replicates within a single experiment? If you really are doing the exact same experiment 3 separate times, then you could avoid p-hacking by just requiring all 3 to result in p-values < whatever threshold for significance you are using.
"...and compare these two means and get a p-value = 0.86." Wait, how do you compare two means to get a p-value? On the Statistics Fundamentals playlist I'm working through, it's only been explained so far how to determine the p-value of a certain event happening (like a Brazilian woman being a certain height). It hasn't yet explained how I can understand the statistical significance of two sample means being a given distance apart from each other if they're hypothesized to belong to the same overall population.
Sorry about that. In this case, I was just assuming that knowing the concept of a p-value would be enough to understand what was going on. However, if you'd like to know more about how to compare means, see: th-cam.com/video/nk2CQITm_eo/w-d-xo.html and then th-cam.com/video/NF5_btOaCig/w-d-xo.html
If example 1 is p-hacking because the drug z result relies on only one test, how do you view all the social experiments that rely on one test (because it would be too costly to reproduce them)?
If you get p-value of less than 0.05 and you set sample size beforehand, are there ways of further interrogating the data to ensure you haven't just got that unlikely but possible '1 in 20' result without accidentally reverse p-hacking (not sure of the right term but taking more samples, or sets of samples, so now the p-value is greater than 0.05)?
If I'm doing an experiment then a lot of other people must have already done experiments and didn't get a useful result - so eventually the spare budget falls to me making my test dependent just the same as if I'd done all of them. Does that mean we must increase our sample size based on the area under the curve of the economic capacity for investigating the problem?
No. It is inevitable that some of our results will be false positives. So we just need to focus on our own experiments and do what we can. That said, we can reduce the number of false positives further by using complementary experiments that essentially show the same thing. Like a criminal trial that lots uses pieces of evidence used to convince us that someone is guilty or innocent, our final conclusions should be based on multiple pieces of evidence.
I did not get one part: I understand why if I add 1 more measurement for each group I would be p-hacking. But does that mean that if I get more measurements than the one previously estipulated sample size by the power analysis I would be p-hacking? so if my power analysis says i need 100 samples, but by the time i did the test i already had 500, what then? What happens if instead of adding 1 more measurement for each group, im adding 100 more? Is this p-hacking as well?
I had expected that adding more data would have made it less likely to get such false positives. Why does the p value decrease as we add more data in the 2nd example?
I did some simulations with a normal distribution and when the p-value was between 0.05 and 0.1, adding more observations resulted in a 30% probability of a false positive.
@@statquest Excellent video. Thank you! So basically you say that observations in the power analysis can not be included in the final analysis? I have just worked through the book "Medical statistics at a glance". It says that it is OK , and calls it an "intetnal pilot study"?
@@statquest just discovered your channel. Do you cover different distributions. What I'm finding is explainions of them in isolation but nothing comparing them, when their used, how to test for them.
We are testing some drugs to see if they change the recovery period distribution, right? Why do you repeat that they are from the same distribution? How we can be sure that the drug didn’t change the recovery period beforehand?
StatQuest with Josh Starmer thank you for your reply. So if I’m not wrong we generated those stochastic numbers with the same distribution as a tool to teach this stuff. But in reality we can’t be sure if they are form the same distribution or not. So we will test them and power analysis will help us to find the right amount of data we need to have a reliable statistical result. Little bam or not a bam at all? :-D
I'm sorry - my previous response was intended for someone else - I have no idea how that got mixed up. Anyway, here's the deal... The drugs can be from any distribution. However, the worst case scenario is when they both come from the same distribution. This means that any small p-value is a false positive (whereas, if they come from other distributions, then any small p-value is a true positive). So we assume the worst to determine how bad things can really get. Does that make sense?
what if i repeat all of the experiments with other samples a number of times, e.g. i do 20 experiments for each drug total, and then i try to look if anything changed is that real bad (cause it isnt guaranteed that i dont just get 20 false positives out of nowhere) or might that be useful in some obscure scenario
Hey Josh, but what if my real difference is smaller than what I considered when calculating the experiment Power? Can't I just keep the experiment running to reach a higher sample size, then reach a good experiment Power for the smaller difference and test it? Would that be p-hacking?
That would be p-hacking. Instead, use your data as "preliminary data" to estimate the population parameters and use that for a power analysis: th-cam.com/video/VX_M3tIyiYk/w-d-xo.html
6:21 is it really False Positive instead of False Negative? The 5% alpha indicates the chance of the first degree error, meaning the chance of rejecting the true hypothesis, which is actually the probability of False Negative
Alpha is the probability that we will incorrectly reject the null hypothesis. Thus, alpha is the probability that we will get a false positive. Thus, at 6:21, we have a false positive.
@@statquest But why would it be the false positive if the null hypothesis is actually true whereas a criterion says the opposite? it's going to be in the left bottom cell of the confusion matrix, which is false negative
I think I understand your confusion. When we do hypothesis testing, the goal is to identify a statistically significant difference between two sets of data. This is considered a "positive" result. A failure to reject the null hypothesis is considered a "negative" result. So, when we get a "positive" result when it is not true, it is called a false positive.
@@dmytro1667 When we reject the null hypothesis, we call that a "positive result", regardless of whether or not the null hypothesis is actually true. This is because when do real tests, we don't actually know if it is true, so we simply define it that way. So, rejecting the null hypothesis, regardless of whether it is true or not, is defined as "positive". At 6:21 we reject the null hypothesis, so it is defined as a "positive result". Because, in this case, we should not reject the null hypothesis, this is a false positive.
I don't see how the first example with the different drugs is problematic, given that the drugs actually are different and that we draw a new sample every time.
@@statquest Thanks! Found some papers on the subject....doesn't seem to be trivial. Great videos BTW. Helps me a lot for refreshing some forgotten basics and also for learning some new stuff as well...:)
What if i get a p value of 0.06 using a sample of 29 and do a power analysis AFTER that and power analysis says that i need a sample size of 30 so i add one more observation to my data. Would this be okay to do?
One thing that confuses me, if the threshold 0.05 means there will be 5% false positives (bogus tests in this example), then how to link p-value (say we got 0.02) to false-positive? Is there anything like 2% false positive involved with p-value? I think that's not the case. I watched all of your p-value videos and it's been made clear. But the definition of "threshold" confuses me. I hope p-value has nothing to do with false-positives. Correct me if I am wrong.
The p-value tell us the probability that the null hypothesis ( i.e.. that there is no difference in the drugs th-cam.com/video/0oc49DyA3hU/w-d-xo.html ) will give us a differences as extreme or more extreme than what we observed. When we choose a p-value threshold to make a decision, like all p-values < 0.05 will cause us to reject the null hypothesis that there is no difference in the drugs, then there is a 5% chance that the null hypothesis could generate a result as extreme or more extreme as what we observed, and thus, there is a 5% that we will incorrectly reject the null hypothesis and conclude that there is a difference between the two drugs when there is no difference.
P-hacking is how you get a high volume of papers published and cited. It's incentivized in all academic fields. Just try asking a researcher if you could take a look at their data and they will ghost you.
I am really confused what do u mean by exact same distribution I mean only the people we are testing are the same the drugs are different therefore if we get different result then why do we assume it as a false positive.
I am really confused what different and same distributions mean here yes the people we are testing on are same but the drug is different in each scenario right so it can have different effects
@@shivverma1459 This example starts at 2:38, when I say that we are measuring how long it took people to recover and these people did not take any drugs. So there are no drugs to compare - we just collected two groups of 3 people each and compared their recovery times. When we see a significant difference, this is because of a false positive.
Question: Why the drug-free experiments are repeated? Why not reuse one set of drug-free results, or even better, aggregate all the drug-free results into one single set?
Even if you re-used the drug-free results, the results would be the same - there's a chance that, due to random noise, you'll get results that look very extreme.
This video starts out tough: "imagine a virus" Damn, can't you come up with something less hard for me to imagine, such as Santa Claus visiting people's houses all on the same night?
@@statquest In that case of course it would, but I find applying this logic to the case of truly different drugs confusing as that is surely not p-hacking if the tested drugs are not the same.
NOTE: This StatQuest was brought to you, in part, by a generous donation from TRIPLE BAM!!! members: M. Scola, N. Thomson, X. Liu, J. Lombana, A. Doss, A. Takeh, J. Butt. Thank you!!!!
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Why the standard is 0.05? What can you say about making changes in the significance value to prove a point? Why 5%?
@@lcoandrade He has no answer to your question because p values are really one big hack that don't prove anything at all.
Search for: "Cohen - Things I have learned so far."
I studied Industrial engineering, however, engineering tends to have a way too "practical" approach in Mexico, so my statistics teachers were more focused on "formula -> result -> verdict" rather than a true explanation of why we do things. This series has helped me a lot in understanding the whole movement behind some of the everyday uses i give to my statistics knowledge. That's truly worthy of a triple BAM! Thank you for sharing your knowledge.
Muchas gracias!!! :)
@@statquest just Found this site, videos today; How do you say 'BAM!" in spanish ??! :)
@@evianpullman7647 BAM!, Doble BAM!!, y Triple BAM!!! :)
@@statquest :) you the da-man !! later i want to be a Triple Bam member, (when i get my Moneys$$$ stuff in order-repair).
@@evianpullman7647 Thank you!
that's the most enthusiasm i've seen anyone show when explaining hypothesis testing))) Thanks for making this thing clear!
Thank you! :)
Just can't leave your videos without giving a Like!!!! Thank You for making our life easy with your pedagogy!!
Thank you very much! :)
Good contents indeed. Sometimes students find it easy to use critical value method but feel kind of difficulty with p value. Your content explains clearly!
Awesome! Thank you very much. :)
So tempted to send this video to my former PhD director... 🙄
Thank you for your immensely valuable content!!! Learning more on your channel than in class 😅
bam! :)
Him: "Imagine there's a virus"
Me: Yea I'm there
Yep...
Damn virus :(
Congratulations and thank you for the videos! I appreciate the clarity, simplicity, and humor in your content. While you probably don't need more compliments, I wanted to express how much I enjoy it and how I find it to be a brilliant way to convey key concepts in statistics. BAM!
Thank you very much!
I am a great fan of your work. Here is an idea/need to enhance learning: It would be helpful that after a video or a set of videos you give us problems (or homework) to learn better. In further videos, you could give us the answers. Thank you
i second that
That's a great idea, and thank you for supporting StatQuest!!! :)
"Instead of feeling great shame" made me laugh out loud hahah
Great lead-out!
Thank you very much! :)
You are amazing. I have taught statistics, but learn something from your videos every time!
LOVE YOUR VIDEOS!!! After discovering one of your videos, I started to review my statistics from the very basic. Here is a little question. I can't figure out how to calculate the p-value from the statistical test that compares the means of two sets of samples, based on the things I learned from the last one 'How to Calculate p-values'. In 'How to Calculate p-values' I understand calculating the p-value of one event can see whether the occurrence of one event is that special (
It sounds like you are asking how to calculate a p-value for a t-test. There are two ways to do it. The "normal way" (and I don't have a video for that), and the "more flexible linear model" way. I have a video for that. Here's the playlist: th-cam.com/play/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU.html
@@statquest Wow thank you for the reply!
Such a lucid explanation!!!.. You make stats so much more fun!!. Thank you
Thank you! :)
7:09 for "Ohhh Nooo"
:)
Thank you for the video! I have a question though. Since we are testing different drugs, why do we need to consider the p-values of other drugs for False Discovery Rate? I thought these were independent events.
I believe this is a intrinsic problem the way "expediction experiments" works. Even if the drugs are independent, given the multiple testing there is the problem with inflation of type I errors. The same would occur by measuring 100 uncorrelated variables from a homogenous two sample (null). You'd expect false discoveries even if the variables are themselves independent and equall between the groups. Also, the are procedures for FDR based on independent and dependent scenarios, eg., the BH and BY methods.
Waiting anxiously the Power Analysis video!!! As always Josh, thank you
th-cam.com/video/VX_M3tIyiYk/w-d-xo.html
This is a great non-technical explanation.
Thanks! :)
Really appreciate the videos. Helps me a great deal understanding what's behind the formulas and what their numbers mean.
Thanks!
I almost understand the first example about drug A to Z, except for one difficulty. If we assume that the distributions for each drug are independent, from the perspective of drug Z, there’s only one test. Then this will not have the multiple testing problem.
If something has a 5% chance of happening, then 5% of the time we do it, we'll get a false positive. So now imagine we did 100 tests for all 27 drugs. That means 5 of the tests for drug A are false positives, 5 of the tests for drug B are false positives, etc. So we have 100 rows of results, with mixtures of true negatives and false positives. When we test each drug 1 time, we get one of those rows, and it's very likely that it will include at least one false positive.
Thank you so much for explaining probability concepts with such good pedagogy (and positive energy). These lessons are usually soporiphic.
Question: why do we need to re-test new samples of recovery WITHOUT any drug every time? Can't we just compare the results of each drug to the results of "no drug" ?
As in, in practice, we would only have 1 placebo group against which we compare the results of a given drug, wouldn't we?
Sure, if you did all the tests at the same time, you could use the same placebo group for all the tests, however, that doesn't change the possibility of false positives.
Thank you. You are a awesome teacher!!
Wow, thank you!
Great work Josh! Could you also do a video on how to compute the effect size? Would effect size be a better replacement for p-value?
Here's a video that shows one way to compute effect size: th-cam.com/video/VX_M3tIyiYk/w-d-xo.html
I feel like this video needs way more views...
Thank you! :)
In the first example going through drugs A...Z you dealt with rejecting the null hypothesis that theres no difference between drug Z and not taking a drug. The next example deals with taking samples from a distribution of people who did not take a drug. At 4:30 what does it mean the null hypothesis is the two groups came from the same distribution? What does it mean that one would come from a different distribution?... Would it correspond to a different distribution of people who took medication instead? Im doing my best to follow along but this one has me confused. P.s thank you for the wonderful videos!
It might be helpful to first watch my video on how to calculate p-values: th-cam.com/video/JQc3yx0-Q9E/w-d-xo.html and my video on the null hypothesis: th-cam.com/video/0oc49DyA3hU/w-d-xo.html However, to answer your questions, we're trying to decide if there is a statistically significant difference in the two samples (note: there will always be differences in the two samples due to little random things that happen, so just seeing small differences isn't enough to decide that there is a profound difference between the two groups). If the difference is statistically significant, that suggests there is a true difference - and maybe one drug works better than another. However, the statistical test is not perfect. Sometimes the statistical test will say there is a difference, even when we know there isn't one (because both groups of measurements came from the same "distribution" of people - for example, people that all took the same drug, or people that didn't take any drug at all). When the test fails, it suggests that there is a profound difference in the groups, even though we know there isn't one.
Thanks for your clear explanation !
Thank you! :)
The P-hacking culture is so endemic that sometimes I seriously get so demotivated that I wonder leaving the field entirely. It's depressing. Thanks, Josh. Sorry about the vent.
Noted!
Big fan of StatQuest, really appreciate the work and humor you put into this.
Just a question on the approach, why do we need to have different control groups for each of the drug for comparison? Can we have one control group that doesn't take any drugs, and have it to compared with all the treatment groups that take different types of drugs. Is there a pros and cons of doing this in comparison to what's done in the video?
You could do it that way, but you'll still run into the same problem.
@@statquest Thanks for the prompt response. Yeh I agree that's by any mean not the solution to the p-hack problem or even attempt to mitigate it.
I was wondering what is the pros and cons of having different control group for each drug vs having one control group for all. Under what scenario should i pick one approach over another?
@@alsonyang230 It all just depends. You want to control for all things other than the drug or "treatment" or whatever you are testing. So, if you can do everything all at the same time, you could just collect one control sample. But if there are changes (like time of year, or location), then you need to get extra controls.
@@statquest Ah yeah, that makes a lot of sense. Thanks for the explanation!
Awesome video as always! Would love to hear your thoughts on moving the standard alpha to 0.005 or some alternatives to p-value reporting (surprisals/s-values) in a future video too!
When we talk about Bayesian stuff, we'll talk about alternatives to the p-value. As for changing the "standard" threshold for significance. That's always been a cost/benefit/risk balance and ends up being field specific. For example, in a lot of medical science, the threshold can be as high as 0.1 because it makes sense in terms of cost/benefit/risk.
Hi Statquest great video! I watched this video and your power analysis video and I have one quick question. If you already collected preliminary data, can you perform a power analysis on that data or would that be considered p-hacking as well? Thanks
You should use your preliminary data for doing a power analysis.
Amazing video, thanks Josh. One question: How did you get the p-value for 2 means ? Is there any video for that ?
You can do it with something called a 't-test'. There are two ways to do t-tests, the traditional way that is very limited, or you can use linear regression and it opens up all kinds of cool possibilities. If you want to learn how to do it with linear regression, see: th-cam.com/play/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU.html
@@statquest Thank you so much.
Thanks a lot for your insightful tutorial
That was really useful :)
Glad it was helpful!
i am a little bit confusted that, as you mentioned previously, p
When we compute a p-value, we select a "null distribution", which is usually "no difference". If the null hypothesis is correct, and there is no difference, then a p-value threshold of significance of 0.05 means that there is a 5% chance we'll get a false positive.
@@statquest so can i understand in this way: say the p-value is 0.05, when our observed data lays on the 5% of null distribution, there is also 5% of chance we incorrectly reject the null hypothesis if we decide the rejection, but we are still 95% confident to reject the null hypothesis.
@@alexlee3511 The p-value = 0.05 means that there's a 5% chance that the null distribution (no difference) can generate the observed data or something more extreme. To be honest, I'm not super comfortable interpreting 1 - 0.05 (alpha) outside of the context of a confidence interval.
This series is fantastic. For the rest of the pandemic, stats instructors should just kick their feet up and redirect their e-campus courses to this channel.
Sounds good to me!
Could you please explain how you calculated the p-value in these comparison examples?
Because these were just examples, I think I just made up values that seemed reasonable.
@@statquest I am not trying to justify the correctness of the number. Just wondering how to calculate the p-value in cases like the example. Do not recall you mentioned in previous videos. Thanks.
@@chendong2197 With this type of data we would use a t-test. I explain these in my playlist on linear models(which sound much scarier than they really are): th-cam.com/play/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU.html
Quality Contain. you earn a subscriber
Thanks!
If each drug trial was a different separate study though, wouldn't it mean the p-value false positive would still stand? Because the authors are acting independently? So you get different results depending on if one person trialled or the drugs or if one person trialled each drug.
The way the p-value threshold is specified to be 0.05, we expect 5% false positives - and, overall, this is a good trade off of cost/benefit/risk/reward. So, if different people are doing tests, sure they will will get false positives from time to time. But the goal is to limit them within our own study, so we adjust the p-values with FDR.
Great video! I have a question about drug testing. Is it better to perform an experiment that tests all of the drugs at once instead of testing each candidate one by one? Besides, if I want to test 6 candidates versus control, each of them has three tech replicates, and I performed the same experiment three times. Should I use the mean of the three tech replicates from the three times of experiments to calculate the p-value? Or are there better solutions for the experiential design? Look forward to your reply!
If you do all the tests at the same time, that's called an ANOVA, and you can do that - it will tell you if something is different or not. However, it won't tell you which test was different. In order to determine which test was different, you'll still have to do all the pairwise tests, so it might not be better. So it depends on what you want to do.
@@statquest Many thanks for your reply! What about the technical replicates? If I have 6 technical replicates for each biological replicate, should I use the mean of technical replicate to stand for each biological replicate?
@@yijingwang7308 If you have technical replicates, then you probably want to use a "general linear model." Don't worry, I've got you covered here: th-cam.com/play/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU.html
Hey Josh, thanks for the videos. I follow the 66daysofdata playlist and am curious if we'll learn the statistical tests you mentioned in the video
What time point, minutes and seconds, are you asking about?
@@statquest such as 05:02
@@konstantinlevin8651 Yes, I teach how to compare means in this playlist: th-cam.com/play/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU.html
Thank you A LOT for this video!!! Please, answer my question: which correction method for p-values should I use, if I make several (say, 30) comparisons, but each 2 groups of observations come from different distributions? To be specific, I compare two variants of 5 different enhancer sequences, each in 6 cell lines, using luciferase reporter technique.
I should have clarified that it doesn't matter if you mix and match distributions. When we do multiple tests, regardless of the distribution, we run the risk of getting false positives. So I would recommend adjusting your p-values with FDR (false discovery rate).
Can you please explain Why increasing the sample size after measuring the p-value would increase the likelihood of false error? I would think it would be the opposite. If you're adding 2 observations to each set, it's more likely that each of these observations are closer to the distribution mean than far away from them. This would imply that the sample means of both sets are more likely to come closer (p-value would increase) than for them to move apart (p-value would decrease)
Unfortunately that is not correct. Even if the new points are closer to the mean, there is a 25% chance that the new point for the observations with higher values will be on the high side of the true mean AND the new point for the observations with the lower values will be on the low side of the true mean.
Can you nuance what is meant by one more measurement and adding more data. I assume they are the same in this context regarding " a subtle form of p-hacking' segment
What time point, minutes and seconds, are you asking about?
What would you say to those who report a p-value = 0.06 as "almost significant" or "suggesting possible significance"?
I'd say "do a power analysis and then re-do your experiment".
intro is a banger
:)
in the experiment 0:29 - 2:18, what if the drug z was actually a better drug? p hacking would happen if we test the results of drug z itself multiple times until we get a false positive. In example mentioned above we just tested 1 drug relative to other drugs, which gave a p value less than 0.05, so why would that be a false positive?
If Drug Z was actually a better drug, then the times when we failed to detect a difference would be false negatives and the time when we detected a difference would have been a true positive. So that means it is possible to make both types of mistakes - if Drug Z is not different, then we can have false positives and if Drug Z is different, then we can have false negatives. The good news, is that we can prevent false negatives by doing a power analysis. For details, see: th-cam.com/video/Rsc5znwR5FA/w-d-xo.html and th-cam.com/video/VX_M3tIyiYk/w-d-xo.html
Let's say I am developing a new method of analysis and it gives me a p-value. Is it ok to keep changing my method to get a better p-value? Would this be p-hacking as well? Thank you for your videos! It helps me a lot!
I don't think so. For example, if you ran the wrong test on your data, then realized your mistake and ran the correct one, I don't think you should be penalized for that. However, in your example, you have to make sure that when you modify your test, you're not modifying it just to get a small p-value. Instead you are modifying it in ways that are statistically justifiable in a broad sense.
Hi I still don’t understand at 1:47 section, why is it p-hacking for rejecting the null hypothesis for drug Z? They aren’t like the latter example where they’re taking the samples from the same population? I presumed we call it a different population for each drug the samples are put on
Say like we have 100 drugs and none of them are effective - they are all variations on a sugar pill. Then we test all of them like in this example. Well, due to random sampling of people taking the pill, there's a good chance that in one of those tests, a bunch of health people take one pill and a bunch of sick people take the other. This will make it look like there is a difference between the two pills even though it's not the pill, it's just the people that take the pill. Thus, this is p-hacking because it looks like the pills are different.
Thank you!
You're welcome!
How do you calculate a p-value based on two means (from two samples) - isn't a p-value calculated between a number and a distribution? what is the exact process of using the two means parameters to calculate a p-value on a distribution?
We can calculate p-values for two means using something called a t-test. For details, see: th-cam.com/video/nk2CQITm_eo/w-d-xo.html and then th-cam.com/video/NF5_btOaCig/w-d-xo.html
Great explanation!!:)
Thanks! 😃
@@statquest DOUBLE BAM!!:)
Amazing! I have a question about Benjamin-Hochberg method. Is this method only applicable to parameter tests such as t test and chi-square test, or is it also applicable to non-parametric tests such as Wilcoxon Signed Rank Test. Thanks a lot.
It works with all tests.
@@statquest Thanks a lot
Hi, thank you for your great work !
However, there is something that I have a hard time processing. I understand the last example of p-hacking, but we are expected to do the same experiments 3 independent times. How do we analyze and represent these results without p-hacking ?
Is is really true that you are expected to do the same experiment 3 times? Or just that you are supposed to have at least 3 biological replicates within a single experiment? If you really are doing the exact same experiment 3 separate times, then you could avoid p-hacking by just requiring all 3 to result in p-values < whatever threshold for significance you are using.
@@statquest Thank you !
.
a nice lecture..
.
Thanks!
"...and compare these two means and get a p-value = 0.86."
Wait, how do you compare two means to get a p-value? On the Statistics Fundamentals playlist I'm working through, it's only been explained so far how to determine the p-value of a certain event happening (like a Brazilian woman being a certain height). It hasn't yet explained how I can understand the statistical significance of two sample means being a given distance apart from each other if they're hypothesized to belong to the same overall population.
Sorry about that. In this case, I was just assuming that knowing the concept of a p-value would be enough to understand what was going on. However, if you'd like to know more about how to compare means, see: th-cam.com/video/nk2CQITm_eo/w-d-xo.html and then th-cam.com/video/NF5_btOaCig/w-d-xo.html
@@statquest Thanks Josh! Wow, what a response time.
If example 1 is p-hacking because the drug z result relies on only one test, how do you view all the social experiments that rely on one test (because it would be too costly to reproduce them)?
The drug z result relied on us testing every single drug - repeating the process until we got a significant result.
If you get p-value of less than 0.05 and you set sample size beforehand, are there ways of further interrogating the data to ensure you haven't just got that unlikely but possible '1 in 20' result without accidentally reverse p-hacking (not sure of the right term but taking more samples, or sets of samples, so now the p-value is greater than 0.05)?
You can use a lower p-value threshold and you can adjust for multiple testing.
If I'm doing an experiment then a lot of other people must have already done experiments and didn't get a useful result - so eventually the spare budget falls to me making my test dependent just the same as if I'd done all of them. Does that mean we must increase our sample size based on the area under the curve of the economic capacity for investigating the problem?
No. It is inevitable that some of our results will be false positives. So we just need to focus on our own experiments and do what we can. That said, we can reduce the number of false positives further by using complementary experiments that essentially show the same thing. Like a criminal trial that lots uses pieces of evidence used to convince us that someone is guilty or innocent, our final conclusions should be based on multiple pieces of evidence.
Please link the video on the false discovery rate on the video index?
Thanks for the suggestion. I've added it.
I did not get one part:
I understand why if I add 1 more measurement for each group I would be p-hacking.
But does that mean that if I get more measurements than the one previously estipulated sample size by the power analysis I would be p-hacking? so if my power analysis says i need 100 samples, but by the time i did the test i already had 500, what then?
What happens if instead of adding 1 more measurement for each group, im adding 100 more? Is this p-hacking as well?
It's better to either use the first set of data for a power analysis and start over.
I had expected that adding more data would have made it less likely to get such false positives. Why does the p value decrease as we add more data in the 2nd example?
I did some simulations with a normal distribution and when the p-value was between 0.05 and 0.1, adding more observations resulted in a 30% probability of a false positive.
@@statquest Excellent video. Thank you! So basically you say that observations in the power analysis can not be included in the final analysis? I have just worked through the book "Medical statistics at a glance". It says that it is OK , and calls it an "intetnal pilot study"?
Great video
Thanks!
@@statquest just discovered your channel. Do you cover different distributions.
What I'm finding is explainions of them in isolation but nothing comparing them, when their used, how to test for them.
@@michaeljbuckley Unfortunately I don't have those videos either.
@@statquest ah well. Still looking forward to go through your channel more.
You are my god! I love you
Thanks!
We are testing some drugs to see if they change the recovery period distribution, right? Why do you repeat that they are from the same distribution? How we can be sure that the drug didn’t change the recovery period beforehand?
StatQuest with Josh Starmer thank you for your reply. So if I’m not wrong we generated those stochastic numbers with the same distribution as a tool to teach this stuff. But in reality we can’t be sure if they are form the same distribution or not. So we will test them and power analysis will help us to find the right amount of data we need to have a reliable statistical result. Little bam or not a bam at all? :-D
I'm sorry - my previous response was intended for someone else - I have no idea how that got mixed up. Anyway, here's the deal... The drugs can be from any distribution. However, the worst case scenario is when they both come from the same distribution. This means that any small p-value is a false positive (whereas, if they come from other distributions, then any small p-value is a true positive). So we assume the worst to determine how bad things can really get. Does that make sense?
@@statquest it is crystal clear now. Thank you very much. And I also started to listen to your music at the beginning of the videos. :-D
what if i repeat all of the experiments with other samples a number of times, e.g. i do 20 experiments for each drug total, and then i try to look if anything changed is that real bad (cause it isnt guaranteed that i dont just get 20 false positives out of nowhere) or might that be useful in some obscure scenario
You'll get more power if you combine all of the data together into a single experiment, so I'm not sure why you would do it another way.
@@statquest oh okay thanks i didnt think of that
Hey Josh, but what if my real difference is smaller than what I considered when calculating the experiment Power? Can't I just keep the experiment running to reach a higher sample size, then reach a good experiment Power for the smaller difference and test it? Would that be p-hacking?
That would be p-hacking. Instead, use your data as "preliminary data" to estimate the population parameters and use that for a power analysis: th-cam.com/video/VX_M3tIyiYk/w-d-xo.html
Congratulate myself being a member :)
BAM! Thank you very much for your support! :)
HI, I wonder if remove outliers several times does it p-hacking too?
It depends. It could be. It could also be just removing junk data.
@@statquest do you have any source for me to be sure not being a p hacker with outliers? Please xs
@@jolojololo3221 Here's something that might help: www.reddit.com/r/AskAcademia/comments/bcop6p/removing_outliers_phacking_or_legitimate_practice/
"I don't do p-value hacking, I raise my alpha level. I'm José Mourinho."
:)
6:21 is it really False Positive instead of False Negative? The 5% alpha indicates the chance of the first degree error, meaning the chance of rejecting the true hypothesis, which is actually the probability of False Negative
Alpha is the probability that we will incorrectly reject the null hypothesis. Thus, alpha is the probability that we will get a false positive. Thus, at 6:21, we have a false positive.
@@statquest But why would it be the false positive if the null hypothesis is actually true whereas a criterion says the opposite? it's going to be in the left bottom cell of the confusion matrix, which is false negative
I think I understand your confusion. When we do hypothesis testing, the goal is to identify a statistically significant difference between two sets of data. This is considered a "positive" result. A failure to reject the null hypothesis is considered a "negative" result. So, when we get a "positive" result when it is not true, it is called a false positive.
@@statquest I agree with you, however we get a negative result when it actually should be positive at 6:21
@@dmytro1667 When we reject the null hypothesis, we call that a "positive result", regardless of whether or not the null hypothesis is actually true. This is because when do real tests, we don't actually know if it is true, so we simply define it that way. So, rejecting the null hypothesis, regardless of whether it is true or not, is defined as "positive". At 6:21 we reject the null hypothesis, so it is defined as a "positive result". Because, in this case, we should not reject the null hypothesis, this is a false positive.
"Bam?" Bam with a question mark has a separate fan base
Ha! You made me laugh. :)
@@statquest And you make me learn statistics in a fun way 🥺
I don't see how the first example with the different drugs is problematic, given that the drugs actually are different and that we draw a new sample every time.
What if the only difference is the size, and they are all sugar pills?
Now how do I detect purposeful p-hacking (e .g. in a paper I have to review)?
Read the methods section.
@@statquest Thanks! Found some papers on the subject....doesn't seem to be trivial. Great videos BTW. Helps me a lot for refreshing some forgotten basics and also for learning some new stuff as well...:)
@@DrZhuBaJie The methods section is usually a supplemental section that you download separately. However, it depends on the journal.
@@statquest Well, yes, but usually attempts of p-hacking are hidden...
What if i get a p value of 0.06 using a sample of 29 and do a power analysis AFTER that and power analysis says that i need a sample size of 30 so i add one more observation to my data. Would this be okay to do?
You should start from scratch.
how do you get the p values of two different sample?
In this video I used t-tests.
One thing that confuses me, if the threshold 0.05 means there will be 5% false positives (bogus tests in this example), then how to link p-value (say we got 0.02) to false-positive? Is there anything like 2% false positive involved with p-value? I think that's not the case.
I watched all of your p-value videos and it's been made clear. But the definition of "threshold" confuses me. I hope p-value has nothing to do with false-positives. Correct me if I am wrong.
The p-value tell us the probability that the null hypothesis ( i.e.. that there is no difference in the drugs th-cam.com/video/0oc49DyA3hU/w-d-xo.html ) will give us a differences as extreme or more extreme than what we observed. When we choose a p-value threshold to make a decision, like all p-values < 0.05 will cause us to reject the null hypothesis that there is no difference in the drugs, then there is a 5% chance that the null hypothesis could generate a result as extreme or more extreme as what we observed, and thus, there is a 5% that we will incorrectly reject the null hypothesis and conclude that there is a difference between the two drugs when there is no difference.
P-hacking is how you get a high volume of papers published and cited. It's incentivized in all academic fields. Just try asking a researcher if you could take a look at their data and they will ghost you.
noted
It was great
🏅🏅
Thank you! :)
I am really confused what do u mean by exact same distribution I mean only the people we are testing are the same the drugs are different therefore if we get different result then why do we assume it as a false positive.
What time point, minutes and seconds, are you asking about?
@@statquest 6:22
I am really confused what different and same distributions mean here yes the people we are testing on are same but the drug is different in each scenario right so it can have different effects
@@shivverma1459 This example starts at 2:38, when I say that we are measuring how long it took people to recover and these people did not take any drugs. So there are no drugs to compare - we just collected two groups of 3 people each and compared their recovery times. When we see a significant difference, this is because of a false positive.
@@statquest ohh now I get it btw love your videos ❤️ love from India.
@ 4: 12 how the p-value is calculated?
I think I used a t-test.
@@statquest Thank you for the response Josh. Do you have any statquest on that explaining how to calculate? Thank you
Question: Why the drug-free experiments are repeated? Why not reuse one set of drug-free results, or even better, aggregate all the drug-free results into one single set?
Even if you re-used the drug-free results, the results would be the same - there's a chance that, due to random noise, you'll get results that look very extreme.
tq v clear
:)
8:45 "Don't cherry pick your data and only do tests that look good"
politicians: I'm gonna ignore what you just said!
:)
This video starts out tough: "imagine a virus"
Damn, can't you come up with something less hard for me to imagine, such as Santa Claus visiting people's houses all on the same night?
:)
I just finished my RNA-seq homework...
BAM! :)
This should be the stats Bible for genz
bam!
Is it really slowed down 1.5 x?))
Some people like 2x It really depends on how fluent you are in english.
I don't get how the first example with the drugs is p-hacking. After all you're testing different drugs, not doing the same test again.
If I took the same, exact drug, and gave it 27 labels, Drug A through Drug Z, then then tested each one, would it look like p-hacking?
@@statquest In that case of course it would, but I find applying this logic to the case of truly different drugs confusing as that is surely not p-hacking if the tested drugs are not the same.
Well basically the first 8 second says it's all, the rest is complimentary.
Bam! :)
imagine there was a virus... well, I guess that it is quite easy to imagine that ...
Yep....
love it
:)
Do you take PhD students??
I wish! :)
@@statquest :(
In the hood they say 'snitches get stitches'
In the lab we say 'p-hackers are backward'
:)
I love you
:)
Bam? No. No bam.
:)
Sweet! I was waiting for someone to post a good p-hacking tutorial! Now all my findings will be statistically significant!
oh, I'm just kidding!
;)
I think I'm in love with you
BAM! :)
We already do😍😍
Double BAM😎
To the 3 guys who disliked.... no bam
:)
Oh how I love you
:)
*shameless self promotion* 😂 josh could purposely make a whole video on nonsense and I’d shamelessly go watch it
Bam! Thank you for your support! :)
😇
:)
Hahaha .. Never seen a teacher like you.. Hey Wait a minute...... I didnt see you.. Baam!!!!!
Nice one! :)
@@statquest I am learning from your videos to become data scientist :)