To try everything Brilliant has to offer for free for a full 30 days, visit brilliant.org/VeryNormal. You’ll also get 20% off an annual premium subscription.
At the end of one Baysian statistics lecture, the professor ended with approximately this summary: "Frequency statistics give a mathematically rigerous answer to questions no one asked. Baysian statistics tells you what you want to know, based on assumptions no one believes."
The reason why frequentist answers to questions no one asked are still somewhat useful is that under some assumptions they are a reasonable approximation of the Baysian answers to the questions you are interested in.
@@Critical-Smoke The easiest examples are confidence intervals. What people actually want are Bayesian credible intervals. And in most practical examples, the full Bayesian treatment with flat or uninformative priors will result in credible intervals that are indistinguishable from the Frequentist confidence intervals. I have seen constructed examples where this wasn't the case, but that was caused by boundary effects of impossible parameter values.
I tend to lean toward the Bayesian approach for two reasons: It tends to be easier to build up complicated models using conditional latent variables, and the prior distribution gives a way to incorporate expert knowledge about a subject. I've worked with many subject matter experts who don't have a firm grasp of statistics, but are very knowledgeable about their own corner of the world. Having the ability to take what amounts to "vibe based reasoning" from them and quantify it using an informative prior distribution gives a lot more power than just using a flat prior.
It seems like having factions that say "Pi is determined by geometry: the ratio of a circle's circumference to its diameter" and "Pi is determined by calculus: the limit of an infinite sum (pick your favourite)".
So off of an hour of TH-cam math, I’m assuming the geometry side is frequentist ideals while the calculus side is Bayesian ideals. And the geometry side will be upset at they calculus side of how to interpret it. Let me know if I’m right or wrong.
I always interpret frequentist statistics as "static statistics" and Bayesians statistics as "dynamic statistics," which works well in my field of study, robotics!
Dude, worst time possible for me to read that comment. I just found out that my Asian crush (who reminds me of my Asian ex) at statistics classes in college has a boyfriend. Everything you wrote gave me PTSD.
Idealists will not stop to debate which approach is more valid, pragmatists will just use one or the other depending which one fits best in any given situation.
One of the advantages of the Bayesian approach is it feels more "natural" to incorporate non-quantitative evidence into your calculations. For example it's pretty easy in a frequentists analysis to calculate the odds of rolling a 6 on a d6 if you have rolled it a hundred times and can see the distribution of prior outcomes (fair or loaded). If instead you tell them we have no prior data but there's double sided tape on the 1 side you can easily "swag" that prior with a Bayesian approach and get better results. I've never actually seen a calculable advantage to either view but if you start fudging some numbers using a frequentist approach it just feel like you are doing something wrong... I don't actually think there is a difference or if there is I clearly don't understand it.
Question about the null hypothesis you selected as I’ve had this is ue come up as well. Why do you select H0) pi = 85%? If you want to make a decision on whether the coffee shop is good or bad, shouldn’t it make more sense to asume pi
Excuse the long explanation, but I'm trying to correct a few potential fundamental misconceptions in your question first. First, the null hypothesis by convention is typically assumed to equal a specific value, though there are some textbooks and sources that use notation like you're suggesting for one-sided tests. The reason for the equals sign even in one-sided tests is because the actual computation of a p-value for a hypothesis test kind of requires this. (In more advanced statistics, there are ways of defining things to get around this and construct a p-value that accounts for a range of null values, but that complexity isn't necessary for this video.) The reality is that the test done in this video is basically trying to compute the p-value using a null hypothesis value that maximizes the probably of a Type I error. If you do a one-sided test, that null value will still be the one that has the equals sign, and any values "to the other side" of the null (in this case, less than 85%) would have a smaller chance of producing a Type I error. The test we're interested in produces a p-value for that maximizing case. To put it more simply from a math standpoint: whether you're doing a one-sided or two-sided test, you still need a specific null value (not a range) to plug into the estimation formulas for the standard deviation and used in computing the z value which is then used to calculate the p-value. The specific null value (here 85% or 0.85) is the one that maximizes that Type I error probability. Any other value below 85% would give a lower p-value and thus potentially produce an inaccurate test result if used by itself. Hence, pi = 0.85 suffices for the null hypothesis. I think what you're trying to ask here is why the video didn't do a one-sided test. The difference in that case would really be in the ALTERNATIVE hypothesis (not the null). I think you're arguing for an alternative of H_a: pi >85% rather than "not equal to 85%". Arguably, you're correct that it might be more appropriate here, as he's interested in whether the proportion exceeds 85%. But even if he did a one-sided test (in this basic case), he'd still be effectively constructing distributions to test a null hypothesis for a specific value, not a range. The null still wouldn't need to be written as
@@BobJones-rs1sd on the contrary, really appreciate you taking the time to answer that thoroughly. You are completely correct about the alternative hypothesis, that was wrong from me. The rest is just ignorance on my part, so really appreciate the explanation
@@BobJones-rs1sd I'm also thinking he left it as a two-tailed test because it was the default value for the test he did it R and didn't really think about it too much. Here, it worked out that the test rejected anyway, but yeah, doing a one-tailed test since that was the research question he was interested in might have been preferred. But it still works for this example.
@@lazerbungalow Not necessarily -- doing a one-tailed test assumes that all of the error / variation must be in one direction, and you need to provide some proof for that assumption. If you perform a one tailed test at p = .05 without such proof, you are really doing a test with p = .10. One-tailed versus 2-tailed is not a function of your hypotheses, it is a function of how error /variation happens in the world. An example of a good one-tailed test is change in kids' height from age 10 to age 12 -- we can pretty much assume that kids don't shrink from 10 to 12, and that the error/variation will only be how much they grow. But note we MUST assume no kid-shrinkage to do the 1-tailed test here.
@@otsoko66 I see what you're saying, but my initial feeling is to disagree. If he is only interested in whether it is "better," then if the sample ends up being at the end he's not interested in, he fails to reject the null. There is no type I error there. The error rate is still 0.05. If the two populations are the same, he will still only falsely reject 5% of the time. Now, you might possibly make the argument that it affects type II error if you are saying that the two populations could be significantly different in the opposite direction that he assumes. Because while that is not his research question, a rejection in that direction is valid science because it calls into question his basic ideas about his hypothesis. So in that case, if you feel it is a convincing argument to do a two-tailed test, then that's something to go on. But he is not really testing at 10% type I error rate, because in one direction he will fail to reject.
One option for avoiding Bayesian prior-rigging is to simply publish the calculation itself, lacking any prior. Then, the probability is simply some function of the prior, which you could graph or something to visualize it better
That seems a really interesting approach - as much for the psychology as for the mathematics. It might make it easier to soften a dogmatic prior into an introspection on the evidence/knowlege/belief that underlay the prior.
It is kind of what is the frequentist interpetation of Bayesian methods. They are “just a fancy estimate” for which you can proove things like asymltotic unbiasedness and normality, etc. Fun fact, but apart from degenerate cases, Bayesian estimates are not unbiased.
Bayesian view doesn't apply very well to statistical physics. There is a concept called micro canonical ensemble, where we assume the frequency of each event to be equal. From this, we can calculate entropy, and from that one calculate other physical quantities of interest like temperature. In Frequentists' view, no problem arises as the physical system's properties is independent from observers knowledge. Everyone agree on the temperature of the box of a gas. However, in Baysian point of view, if someone (say by prior measurements) have extra knowledge about the system, they would not assign equal probabilities to each individual events, causing them to claim a different entropy, temperature, etc, which would not agree with our actual observation. I have seen some effort to fix this, however, Baysian view is not as natural as this video makes it to be.
@PrParadoxy I want to hear more about this! But I would argue that the Logical Bayesian approach has to do with the process of acquiring information on a real phenomenon and updating the status of knowledge of the observer. Its objects are propositions, not the physical system itself. On the other hand, for statistical physics and quantum mechanics, the probabilistic nature of the evolution of the system is a characteristic of the physical model of the system, has nothing to do with the observer (before measurement) and is objective in nature. While the system evolves, there's no update of knowledge to do for anyone. Moreover, when you build the concept of ensemble, one basically starts with the idea of taking an infinite amount of copies of the system. What this means is that when you go and do measurement of the system, when you want to check if your model is correct or not, the model itself is not deterministic but comes with probability distributions. But the Bayesjan statistics would apply in the context of characterizing statistical properties of the system (P, V, T, S, U, whatever) given the experimental data with its own uncertainties and the model which is now non-deterministic.
Does "everyone agree"? Isn't that worse than subjective in that it is also collective? What about the ratio of matter to anti-matter in the universe, how does the apriori assumption work in that case? Wouldn't it be a good idea to consider adjusting the parameter, the theory, because the data doesn't fit well with the prior.
@@bjorntorlarssonCertain physical quantities are only meaningful in absolute sense. Imagine if someone tells you the temperature of your boiling water in the kettle, is in fact 0 kelvin. They reason that they know the the micro state of each individual particle, so the amount of uncertainty, or the entropy if you will, that they have about the system is zero. It does not make sense, does it? It has nothing to do with parameterization, really.
@@bjorntorlarsson Firstly you choose your hypothesis H, which is a proposition like "This physical process can be described by this specific model which has these specific parameters". Then the prior gives a probability distribution for the parameters of that theory given all you already know, both about the physical process itself you wanna describe and the model you're trying to describe it with. Then through the likelihood what happens is that your prior knowledge about the parameters of that model is updated and you get a new probability distribution "a posteriori", that takes into account the new data. If you change the model, you consider a totally new prior associated to the new model. In the case of choosing the relative value of dark matter fraction or dark energy fraction, it's not a change in the model. If you want to be totally agnostic you choose a prior so that all the Omega_i sum to 1 but are then free to vary within the allowed range. If instead you've done multiple experiments already on the LambdaCDM model another choice could be to take as your prior the posterior given by the chain of former experiments. I don't see the issue here
Bayesan probability still has the same definition as frequentist probability!!! What you are showing is not a "definition" of probability, it is just Bayes' rule, which says NOTHING of P(A), only of P(B|A). The law of large numbers gives the definition of probability, regardless of what field of maths you study. I feel like this was really misrepresented in the video.
Your comment only shows one thing, you don't understand the Bayesian viewpoint. Back when I was doing my PhD I had a designated helper from the statistics department to coach me on methods. One day we started talking about Bayesian thinking (he was a frequentist). After trying to do some math with me being confused he stated (again as a frequentist): If I roll a die covering it with my hand the probability of it being a specific number from one to six is 1/6. As a frequentist, I say it IS one of the six(it's a physical entity) with said probability, a Bayesian will say we belive it to be something but it isn't until we discover more (remove the hand). This distinction makes very little sense for his example, but a huge difference for more advanced statistics. (and all the natural sciences that depend on it) As a sidenote, there is more than one kind of math and they don't always agree. Look it up.
@@foresthobo1166 Whether or not you calculate the probability as a frequentist would, what you are trying to estimate through Bayesian thinking is how likely a specific outcome is to happen. If given the means, you can verify your result using the law of large numbers by running a bunch of experiments. That is probability, regardless of what method of dealing with it you subscribe to. I feel like this is not particularly debatable or hard to understand.
@@xenoduck3189I’m a probability and statistics professor for physicists, and a little bit of a quantum information scientist. Sure enough, the probability of physical events should not depend on your interpretation of probability. That is, both a bayesian and a frequentist must agree that the probability of obtaining a particular outcome in a fair dice roll is 1/6. However, the frequentist states this due to previous experiencie with fair dices, whilst the bayesian does it due to complete lack of information of the result of the dice roll. Once the bayesian sees the results (obtains information of the system), they update their probability distribution to one where the observed result of the dice roll now has unit probability (a distribution with zero entropy, i.e. a state of complete knowledge). An instance which may help understand the difference in the interpretations is the following: for a frequentist, the question “what is the probability that God exists?” does not make any sense and cannot be answered since there is no way of performing trials for the existence of God; something that underlies the frequentist definition of probability. On the other hand, a bayesian may say there is a 50% chance that God exists, since the answer is binary (God either exists or not) and, in this case, a 50-50 probability distribution is the one which best describes the state of complete lack of knowledge (i.e. maximum entropy).
This is incorrect. It makes sense but it isn't correct when you start thinking about events that have eventually but single state. For example the sun WILL go supernova. 100% this will occur. But the sun WON'T go supernova today. 100% it will not occur today. So what happens is that the Law of Large Numbers struggles with events like this. The law tells you it WON'T happen (though we know it will) because each day as an experiment suggests it won't but it also somehow tells you it WILL happen because many stars have this fate. So how do you deduce the odds of the sun going supernova today? If you say it's zero which the LLN suggests and it happens then you were wrong but if you say it's one such as the LLN suggests and it doesn't happen you are also wrong. Sequenced odds that are not IID do not work with LLN. It's just completely incorrect for more complex systems.
I don't know why you need to say that Bayes' rule isn't the definition of probability? When he brings it up in the video, he is pretty clear in talking about P(A|B). He is talking about the philosophical interpretation between the two viewpoints. Bayesian thinking gives us a different way of looking at our error and at our assumptions.
I think I figured this out when I first came across the Monty Hall problem. You can set up all kinds of scenarios where different people can assign different probabilities to the same question because they have different information. I ended up thinking about probability intuitively as a "measure of ignorance".
To understand the beta distribution: 1. Imagine you are sending rockets to an alien planet to see what portion of the surface is covered in water. 2. You can send probes that hit the ground, are destroyed instantly, but send back whether they landed on water or land. 3. Let's say you send down a probe and it says it hit land. 4. If the entire planet were water, there is a 0% chance the probe would say land. If the planet were 100% land, there is a 100% chance of it happening. If you plot the percent of land on the planet on the x axis, and the relative probability that the probe says land, you get a line from 0,0 to 1,1. You can image the opposite being true if it said water: a line from 0,1 to 1,0 5. If you send another probe and it says water, then you can combine two plots. multiply the land plot by the water plot because at each possible percent of land on the planet, the probabilities are being combined. You'll end up with a parabola. 6. Keep multiplying the right plot if the probe says water or land, slowly you'll get a bell curve where the peak is at the ratio of land and water probes. This is what the beta distribution is.
Except that this interpretation only works for positive integer parameters α, β, whereas the full scope of the distribution works for any positive real numbers α, β. An actual description of the beta distribution would require a bit more complicated “probe sampling” than described here-having troubling thinking up any alterations to the described example at 3a tho.
I think that for any simple problem like this, enough data will make up for any “wrong” prior. But I’m not sure what it means to have a wrong prior in the first place
(Talking as a complete amateur here) As I understand it, having a "wrong prior" is no less fatal than making certain assumptions that aren't applicable to your circumstances in a frequentist context (given that I strongly believe there's no such thing as "simply" using a frequentist approach... or "simply" using statistics in general😅). Both will lead to the same outcome of misinterpreted or straight up wrong numbers, so I believe this is more of a "pick your poison" situation.
If that’s the case, then I’ll need to rely on someone to call me out on a bad prior. If results are going to be published, they have to be vetted. Why risk someone misusing frequentist statistics when we can force them to express their beliefs via the prior
Most of the time if you have no good guess you actually use non informative priors, which is equivalent to what you suggest. But, even with that, some people would argue that the interpretation of the findings is different, and they would be right
Apart from maybe being easier to understand for some people, I don't get what the bayesian approach adds. For problems with a very small frequency or sample size you prior is the thing that's going to influence the outcome the most so you are essentially guessing where the frequentist would say "no idea". Doesn't sound like a huge improvement to me. Edit: I guess if you HAVE to make a decision, saying "it's between 0 and 100" is better than saying "it's 50 but I'm probably wrong"
There is an excellent 3blue1brown video that explains a situation where the Bayesian approach is very impactful: health screening tests. Imagine you have an extremely specific test for a rare disease (let’s say the “true” probability of having it is ~1/10,000 people), something that only gives a false positive 0.1% of the time, and a false negative 1% of the time as well. That’s a great test, right? We should give it to everyone to screen for this disease! What’s the harm, right? With such a low error (from the frequentist perspective), most people who test positive will have the disease and be able to be treated. Well, hold on though, is that assumption true? Imagine giving this test to 1,000,000 people. Since I’ve defined this disease to have an actual objective rate of 1/10,000, about 100 people in this group actually have the disease, with 99 of them being caught by the test and 1 missed. On the other hand, since the test has a false positive rate of 0.1%, 1000 people have been given a false negative result. That means on a test with extremely low type I and II errors, your chance of actually having the disease if you get a positive on the test is only about 10%! That’s incredibly unintuitive from a frequentist perspective, and how one would even go about getting that number and justifying it isn’t really clear. Bayesian statistics, however, bake all of these assumptions into the calculation, so they can be interrogated and updated. That 1/10,000 number was something that I just magically knew in this example, but a Bayesian statistician can get a similar prior probability estimate from any number of sources. The really important part of this is that it demonstrates the need for multiple screening techniques, because they represent multiple times that the probability that you are positive for a specific disease are updated. This is why, for instance, it is no longer recommended that all women get mammograms after a certain age unless there is some other indication that boosts the probability that they have breast cancer: there were too many false positives, and false positives are not free. They cause stress and anxiety for patients, they cost additional healthcare resources, and they dilute the pool of patients who actually need care with patients who only have been told they might need care.
Surely more people have visited the cafe than have left a review? So repeating the experiment and collecting another ~1k reviews from those who simply hadn't written theirs down isn't all that far fetched. Impractical, yes, but entirely possible. The idea of 'repetition' breaks down much more when we think of data we can't really sensibly resample/remeasure (like a country's annual GDP or the employment rate).
The fact that more people have visited than left a review could be already a bias for the statistics because maybe for some weird reason people writing reviews are already prejudiced in their judgement all in the same way. I guess, this could be modeled in a Bayesian approach but a frequentist just has the plain numbers and cannot take anything else into account.
To name individuals "frequentists" or "bayesians" is probably one of the most misleading thing one can do when actually trying to explain this in a helpful way.
Excellent video!! I’m almost done with Bernoulli’s Fallacy myself. I do want to add that, for what I’ll call “reasonable” priors, the choice of prior doesn’t matter in the long run, as the data will dominate the posterior through the likelihood. Basically, with Bayesian statistics, we’ll find the truth if we just keep on collecting more data. Again, thanks for this great summary! I teach both a high school stats course and a high school Bayesian data science course, and this is the best short explanation of the difference I’ve seen. Congrats!
@@simonpedley9729 So is this a chicken or the egg situation ? (which really doesn't make sense because eggs came first as other animals had them...like chicken's ancestors....so egg always came first!).
Ok but if you can just keep collecting more data, you can just do a frequentist analysis. If the priors eventually “drop out” of the calculation, all you’re left with is the experimental ratio.
That would seem to be a concession to the frequentist though? That is, the specific reason given for why Bayesian approaches are better is that frequentist assumptions either don't make sense (e.g. long-past or one-off events cannot be understood as having a "frequency") or refer to impossible actions (somehow collecting a brand-new, comparably-sized set and running the "experiment" again, when such data simply doesn't exist). If fixing the problem of a bad prior requires repeatedly collecting data, the Bayesian is now in exactly the same hot water as the frequentist: they both need do-overs that are impossible or nonsensical. Under those lights, the rationale seems to be in the frequentists' favor by parsimony: the Bayesian is embarked, they _must_ commit to a prior, but the frequentist does not. Instead, the frequentist commits to a particular risk of making a mistake. Now, I'll note that I'm a pretty firm frequentist who does not have a very positive view of Bayesian methods (not least because I find a lot of Bayesian boosters make some strident and excessive claims...), but I think the point still stands. If the Bayesians' problem with frequentist methods is that the latter requires imaginary repeats, why don't they also have a problem with the risk of bad priors for things where we're only able to update our beliefs a very small number of times.
@@ZekeRaiden To add to your final comment about bad priors...the Datta, Mukerjee, Ghosh and Sweeting (2000) paper shows that the error due to having the wrong prior, and the error due to not having enough data, are the same order, O(1/n).
Do you use Manim to make your visualizations ? I love how you work through the concepts and keep the canvas as clean as possible. Keep up the great work 🎉
I often look at this debate through the lens of physics models, where you can have one model that is simpler and often "good enough" in most scenarios, and another model that is much more complex and able to more accurately describe a larger number of scenarios. Examples being electron orbitals vs electron cloud, or newton vs einstein. Here, I consider the frequentist approach to be the "simpler, good enough" form and the bayesian approach to be the "complex, more accurate" form.
Excellent, concise description of the essential differences between the Bayesian & Frequentist philosophical perspectives with examples. The frequentist methodology as used today is all too often a hybrid mess of two distinct approaches. The physically separate frequentist approaches of Fisher and Neyman-Pearson have been mistakenly combined in a manner which neither Fisher nor Neyman would have approved. Bayesian findings tend to be more intuitive than frequentist results - so much so that frequentist analyses are often interpreted as in a Bayesian framework! For example, most consumers of statistical information will interpret a frequentist 95% confidence interval in the context of a bayesian 95% credible interval - as the later is much more intuitive to understand!
Confusing a confidence interval with a credible interval is just a common error. We can run an experiment to calculate pi and might find a confidence interval of (3.1,3.2), or (2.9,3.1). I guess some people might say that our belief is that the probability pi is in (3.1,3.2) is 95%, but these people are wrong.
In my opinion, these two philosophies can be reconciled by thinking of frequentist statistics as just approaching the problem with a specific prior that is asking "am I X percent confident in posterior outcome A?"
Thanks for your video! I’m by no means a statistician, but I find Bayesian inference to be interesting and valuable in and of itself when you look at statistical learning. When you need to examine and theorize about the process of learning, viewing probability in terms of a belief updating process is extremely useful. So many people get stuck on the “Bayesian stats is subjective”, but if you’re looking at a machine learning model, the point is that over time it can learn and reduce its error over a training process using belief update rules. Is there a frequentist interpretation of machine learning?
> Here we are trying to understand what Mostra cafe's good rating is. We did a hypothesis t-test with 1 degrees of freedom. Since we evaluated only 1 Mostra cafe instance. Realise that we are not comparing Mostra cafe to other cafe's. > But trying to understand the Good Rating for Mostra cafe in real life based on its Rating in Google.(Which is the sample, not the population) Using the frequentist approach. >We realise the google rating for Mostra cafe is around 0.88 (and we are 95% CI the actual real percentage is between 0.859 & 0.899). But we dont really know the actual probability the good rating is at 0.88, We just know that the population mean is 2 standard deviations around it.
Frequentist approch looks like a neutral pragmatic approach to statistics while the bayesian approach is more flexible and adaptive approach. I'm sure each have their own strengths in different situations. I believe frequentist approach is good as an first value estimation when you know nothing about the data you're studying while bayesian approach allows you to get more precise results as your understanding gets better. I know what I'm saying isn't mathematically rigorous but it mathematics is always derived from "desired properties" and it very much look like these approaches are developed for the desired properties they offer. If you have any objections, let me know I'm very interested in learning more about what you think.
Great video. I always thought the frequentist approach was actually more subjective, because what is truly considered a relevant counterfactual is open to interpretation in many cases (In my view, the Bayesian is more up front about this). Alan Hájek has a lot of great work on this-in fact Hajek is worth reading in general. What may be interesting to note is that in my field (Electrical Engineering) I find most of us are Bayesians. It may be because there is an intuitive connection between Bayesian statistics and topics in information theory like entropy and mutual information. Still, there is a promise for a hybrid view: as Roderick Little suggests, “inferences under a particular model should be Bayesian, but model assessment can and should involve frequentist ideas". Also it is interesting to note that Clayton's book Bernoulli's Fallacy borrows quite a bit from ET Jaynes (though I disagree with Clayton on a few points). Janyes was a great statistician but he was as hardcore a Bayesian as they come.
The idea of a repeated experiment showcases the inherent variability in the parameter estimate, i.e. the sampling distribution. A frequentist assumes that there could be a different data set borne by the same invisible data-generating process (law). Bayesians tend to jump onto data matrices as if those n=200 observations were the one and only realisation possible, without other hypothetical scenarios occurring, as if there were no Heisenberg principle or quantum uncertainty. The frequentist approach reflects the randomness of Nature and unobservability of hypothetical outcomes better: ‘it could have been otherwise’. Finally, Bayesians often make ridiculous distributional claims: ‘assuming the prior normal distribution, the posterior distribution of the linear regression slope estimator is precisely Student with n=198 degrees of freedom', whilst frequentists are much more careful about heteroskedasticity, calibration, coverage probability, and Bartlett correction, which are essential to control the false discovery rate: ‘there is some unknown law, but we can compute some functionals thereof regardless of the joint and marginal distributions, as long as enough finite moments exist for the WLLN and CLT to work’.
I viewed the data as binary, so it needed to come from a discrete distribution. The binomial is the most commonly used family for this, but nothing would stop me from other discrete distributions that also fit binary data. I would lose conjugacy, but there are tools for doing Bayesian things in that case
This video gives the impression that Bayes Theorem is exclusively part of Bayesian probability theory. It's basic set theory and applies whether you are a frequentist or Bayesian. Another issue is that the finite number of measurements applies across all of physics. You cannot, for example, calculate an instantaneous velocity. You can only measure position either side of a finite time interval - and calculate an average velocity. That does not, however, mean that you cannot use calculus in your physical model and cannot use the concept of an instantaneous velocity. Likewise, although you can only repeat an experiment a finite number of times, you can use mathematics to model an infinite number of experiments. We are free, therefore, to use the mathematics of infinite sequences in probability theory. It's not even necessary to believe that there is an absolute underlying probability: only that you can usefully model a scenario using the mathematical concepts of absolute probabilities and relative frequency as the limit of an infinite sequence of experiments. That doesn't need to be practically achievable in order to be a valid mathematical model. Otherwise, physics would have to rely solely on the mathematics of finite numbers! Finally, I don't agree that a Bayesian can believe that A has a 20% probability and not A a 50% probability. That would be absurd. The priors have to be consistent. In fact, both frequentists and Bayesians are essentially tied to the Kolmogorov axioms.
What is probability? : one also needs to compare and contrast that with 'statistics' as either synonyms or distinctions to help with discussion. The frequentist 'close enough for practical purposes' get out also isn't great from an engineering perspective either ('when will the bridge fall down?', 'tracking a radar blip', etc.). I feel that the Bayes formula starts as a 'complicated' (tricky to visualise) formula, and that P(A & B) = P(A|B).P(B) = P(B|A).P(A) is an easier starting point that is just as simple as frequentist counting with the same underlying assumptions (Belief: identically and consistency of the independent events)...
Yes, I totally agree that for large sample sizes the two methods basically give the same asnwer. They are also compatible in the sense that Bayesians have their own interpetarion for maximum likelihood and Bayesion methods can be analysed via frequentist language. (Infact it is more natural to understand the limit theorem mention in the video in frequentist terms, in my opinion.) Still, I want to make some remarks. Firstly, I’m sort of a prularist. I don’t think probability stands for a single concept. Statements like “what is the probability that this previously unknowm sonnet was written by Shakespeare” can be interpeted in a Bayesian way much more generally, while physical problems (see belowe) makes more sense in the frequentist interpetation. Ultimately, there are many things that satisfied by the Kolmogorov axioms that has nothing to do with randomness. (Say the ratio of votes in an election.) It is possible to do probability theory without reffering to randomness at all. There are cases when we do actually talk about frequencies in the world. Ergodicity is s good example. Saying things like “if I know the exact inital conditions, I can calculate the exact ratio of times the coins will land on head” and therefore “probabilites are purely epistemic” kind of misses the point. I’m not interested in this very particular initial condition. I want to show that this behaviour when roughly 1/2 of the coin tosses lands on head is typical for most of the initial conditions. This 1/2 number is a property of the system, and it doesn’t describe the mental state of an idealised, rational observer. (With the obvious objection that of cource observations themselves are model dependent, etc.) Lastly, all the populat interpetations have their own philosophycal problems. I don’t know any interpetation that are not ultimately flawed under greater scrutiny. This is actually very typical when it comes to phylosophical problems. (Think about all the different schools of ethics.) I think I like the propensity interpetation of probability the most, but that is not perfect either.
Btw - my arguments with statistics do not mean they are useless. Rather, they are frequently and all too easily abused (intentionally or unintentionally). In every instance, the use of statistics as applied to real world analyses must be critically analyzed and scrutinized, starting with the assumptions, presumptions and desires and relative ignorance of those involved. Even when all of that is fair, statistics often goes wildly away from truth or reality. And researchers often fail to apply even the most basic critical analyses to the results. Is the population a single population? Is the population linearly, triangularly, normally, poison or other distributed? Are there hidden variables? Is the data the result of stochastic events acting on stochastic events? Do the results violate sanity? Do the results suggest results outside the bounds of the analysis? Is the thesis or hypothesis that resulted in the data gathered biased in its own rights? Etc...
I got lost. With the coffee shop exercise, "The probability it receives a 4 or 5 star review". Receive from whom? Does it mean it 'has' received by past customers or is it ideally about the coffee shop's track record at the end of time by its reviewers?
When I use machine learning algorithm to predict stuff, is it the bayesian way or the frequentist way ? or something between both or does it really depends on the data distribution or depend of the specific machine learning algorithm ?
I think it depends on the model. For prediction, I don’t think the distinction matters all that much. I don’t work a lot with prediction but this has been my experience But for inference, it changes how you do statistics and interpret results
14:20 LOL, always bugs me. "Consider multiple universes...." is both difficult to communicate and interpret. "Is our prediction right or not? Are we 95% sure this number is right?" seems much more meaningful
You actually don't have to consider multiple universes. Does an integral bug you as well? Also, the are we 95% sure thing would actually be, given our prior, we are 95% sure. Bayesians who champion a prior seem to forget about it sometimes :p
@@ucchi9829 I might be missing some details about integrals, but they've got straightforward definitions? Googling "confidence interval misinterpretion" brings up a lot of results. The issue I have, and maybe it's my own educational one, is that stats/plot libraries output this range and it's "wrong" to interpret it as 95% probability that the metric of interest is in that range.
20:04 - "You can have strange priors, but you're going to have to justify them with evidence." But in that case, they're not really a subjective prior at all, are they? If they're properly evidence-based, then they're objective, surely? And in that case, the fundamental objective/subjective difference that you'd previously described is no longer there. An objective prior followed by an objective analysis gives an objective result, and a subjective prior followed by an objective analysis gives a result that is to some extent not evidence-based!
The priors are subjective, because they are still degrees of belief, not real properties of some object. People disagree what is "properly evidence-based" and that is reflected in their priors. Two people with the same base evidence would also have the same subjective priors, but having the exact same evidence isn't really possible for humans. If you had literally all the information about everything you could theoretically get "objective priors" but in that case you aren't dealing with probabilities anymore but just know the correct answer. An "objective prior" could really only be 0 or 1, because in a Baysian sense "objective probabilites" don't exist.
@@happyduck1 Of course it's possible to have two humans with the exact same evidence. If you run some sort of experimental trial, the maths shouldn't change on the basis of which particular researcher analyses the dataset that you obtain from that trial. (When I say "shouldn't" I'm assuming of course that science "should" be objective and evidence-based, of course, but that should be taken as given: it's definitional to science!) I'm not a mathematician or a statistician as such, but I am an engineer with a background in research so it's not like I'm averse to the numerical and the analytical. I periodically try and understand what this supposed crucial rift is between frequentist and Bayesian stats and it always _appears_ to me to come back to two possibilities, as far as I understand them: a) there's no real difference at all. They're just two different ways of framing the same underlying ideas, which might be practically useful (based on how you expect data to come in over time, for example), but does not mean there's any distinctions of principle between the two. This is the appearance I'm often left with, but my certainty is challenged when I hear statisticians insist that there really are important underlying distinctions of principle that really do make a real difference. b) there actually is a meaningful difference between the two, and that difference is around this notion of subjective belief in the priors. That there wouldn't be a difference if the priors were truly objective, but they don't have to be, and that is where the difference of principle creeps in. And if that really is the point of distinction between the two approaches, then that strikes me as nothing more than an attempt to "launder" subjective prior beliefs which cannot be stringently evidenced into an result which has the _appearance_ (since it came out of a statistical formula containing lots of actual evidence in the non-prior part) of objectivity. And that seems to me not to be science!
Not a statistician, but I do have a take on this... The Bayesian method relies on priors which hamstrings the whole practical purpose of analysis. Instead of debating results, people instead debate priors. It just shifts the whole thing from one frying pan to the other. The simplistic frequentist approach you described is utterly naive. You completely missed the whole concept of random walk. In practice, the most reliable approach to probability is the non-naive version with a large dataset, or a large set of datasets. Random walk is critical to understand for the frequentist approach to make any sense. For example, imagine flipping a balanced coin 4 times (small example, easier to explain) The naive approach would assume that larger datasets tend towards 50% heads, but this doesn't make sense. The probabilities are: 0% heads -> 1/16 25% heads -> 4/16 50% heads -> 6/16 75% heads -> 4/16 100% heads -> 1/16 It's a bell curve centered at 50%. With large data sets, your chance of getting the expected 50% result is only around 6/16, but your chances of getting either 25% or 75% is 8/16... Which means the naive approach is more likely to give an inaccurate result! Random Walk (results steering away) is a huge topic in itself and definitely needs to be accounted for to rely on the frequentist method.
How come we assume everything is a Gaussian or treat things as if they where? Alot of statisticial tests rely on it, but it seems like all of the factors for the test to be valid is always not respected.
@@very-normal CLP doesn't always hold though, especially if your working with a distribution where higher moments diverge. In finance and physics this can happen more often.
@@Impatient_Ape my goto example is the Cauchy distribution because it looks normal, but it's expected value is infinite. It is also the ratio of two normal random variables so it's actually easy to unknowingly make a model cauchy if you start looking at ratios.
To me this seems like the frequentist approach starts with "the experiments is all we know" and therefore you can calculate the probability directly from definition, while bayesian starts with some belief about what we expect and we try to use not just the experiment data but also other knowledge we may have. Wouldn't then bayesian approach with uninformative prior always reproduce (correctly done) frequentist approach? The frequentist approach is based on the implicit assumption that every possibility is equally likely, with bayesian you don't necessarily have that assumption, you may provide it explicitly though.
It's very arguable if the idea of probability itself is a fundamentally real thing. Like as a mini example the digits of pi behave random by every single metric we know, yet they are determenistic and nothing random is happening. The ultimate goal of probability is modeling unknown outcomes and that can be done in many ways. So there is no true right option, all we care for is how accurate we can predict things and how interpretable it is to us. (ps in my eyes Bayesian feels more true to real life and my thinking)
I'm not sure what you mean by "real" here. Casinos make real profits. A digit of pi, selected at random, (it is believed, but not proven) has an equal probability of being any number. Meanwhile, a digit of 50/99 (in base 10), selected at random, will be either 0 or 5 with equal probability. These things seem real to me.
@@weetabixharry I meant the sequence is random but deterministic, if you pick random digits you introduce other randomness. My thinking is multiple things appear to us as something random but if we knew the underlying dynamics we could often agree that probability theory is the wrong approach. Lets imagine an event i can only measure a single time like "Alex immediately says yes if ask him on a date today.", the idea of doing repeated trials is not real unless i have access to parallel universes, and taking other variables into account to refine my guess like comparing with other people i asked gives confidence but doesn't fundamentally reflect Alex choice then. Even if we measured every atom interaction in Alex brain we get into discussions of quantum and chaos theories. So even if our best models say the probability was 50% we cant tangibly experience or measure that 50% since we only see one outcome.
@@seriousbusiness2293 I think I see roughly what you're saying... and it's uncomfortable to think about. I only feel relatively comfortable in the simple cases where the tests are repeatable and the "parallel universes" all behave the same. For example, I need 1000 dice all rolled in parallel to have the same statistical behavior as 1 die rolled 1000 times. And my dice have to have a *known* probability distribution (preferably, perfectly uniform) or I'm gonna panic.
@@weetabixharry haha 😂 i feel ya. Ya in any case im sure that probability theory is an extremely good tool for reasoning and decision making and often close to some Truth. But as soon as we get philosophical about the fundamentals then there is room for doubt. I think its comparable to the situation of going from Newtons Theories to Relativity Theory. Having a fixed frame of reference makes the math easy and it works most of the time but if you care about fundamentals and edge cases you need a relative model of physics. Thinking about dice and cards is more a clean setup like a Newton model that assumes each object has some absolute probability making for an actually very good model. But converting any probability number into a tangible real world concept may not always work and may need a more nuanced idea of what that number means, like in relativity we found that two observers can disagree on a space or time measurement but that gets fixed if you talk about the new concept of space-time.
I take offense at that remark against statisticians on their incapacity for violence. I'd have you know, Sir, that statisticians are just as likely to commit violent crimes but have less probability of being caught because they know how not to become a statistic.
If anyone is interested in if there is an objective way to pick a prior probability distribution, you do it with something called "maximum entropy". And the entropy they refer to is the same one the physicists talk about.
I disagree. In the case of the p parameter for Bernoulli would be the uniform distribution. That is, however, depends on the coordite system you choose, as opposied to other methods like Jefferey’s prior. Maximal entropy arguments in general rely on some assumption of a unoform distributions even in physics. (Think about the whole combinatoric derivation with Stirling’s formula.) Ultimately, all models depends on assumptions let it be frequentist or Bayesion. There is no such thing as “purely leting the data talking for itself”.
@@danielkeliger5514 I agree that there is never a way to "let the data talk for itself". I think i misused the term "objective". There are reasons to use the maxent distribution to ensure you aren't adding any "hidden" assumptions to your analysis.
I noticed that my interpretation of the frequentist confidence interval is quite Bayesian, and I have seen this often in courses as well. What is your take on this @very-normal ?
Another advantage of Bayesian statistics is that the joint posterior allows for the calculation of the marginal distributions for the parameters and probability statements can be made regarding these parameters.
There is a mechanics anologue to this: Do you use classical mechanics or include relativistic effects? Depends. If classical is good enough you use that because relativism reduces to classical for simple and slow systems. Frequentist or Bayesian? Same reasoning. Depends. If your problem is described well enough (or perfectly) by frequentist approaches you use that, otheise Bayesian. Because why would you shoot yourself in the foot intentionally just to do it the more complicated way?
Interesting video. I'm a little bit surprised, though. I'm fairly confident (let's say 0.80) that the uninformative prior for the binomial distribution in a beta distribution with parameters alpha=beta=1/2. I'm using Jeffrey's priors. If there's something I'm missing, I'd like to know.
It doesn’t matter much in this context because there’s so much data that it dominates the posterior. From my perspective, the prior parameters can represent “past” successes and failures, and Beta(1,1) just says we saw only one of both. Having 0.5 of a success doesn’t make as much sense, but it still works in the end. In a paper, we might justify our priors slightly differently
@@very-normal, I concur that the alpha and beta parameters are directly linked to the numbers of successes and failures. Jeffrey 's priors are proportional to the square root of the determinant of Fisher's information matrix, it cannot be as readily interpreted. If other methods for uninformative priors exist, I'm interested. Thanks and thanks for the video!
Bayesianism is just superior. It allows for straightforward statistical connectives and gives us distributions rather than rigid numbers. It’s just a lot richer and might also lend itself more readily to generalizations of statistics once we understand them better (eg negative probabilities and so on)
I guess ,probability is derived from geometric property of our microspace (like general relative theory is derived from timespace geometry). So frequentist approach is more relevant.
The Bayesian camp drives artificial intelligence. It is a viable approach by the grace of Big Data. It is a double-edged sword. It can sometimes ferret out subtle patterns that humans would miss, but there is risk of conflating correlation and causation. The frequentist approach works best if you have a theoretically perfect coin with an exact 50-50 chance of heads or tails. The Bayesian approach works best if you CANNOT be sure in advance whether a coin is loaded or honest, but want to make the best estimate as to the outcome of the next throw, regardless of the uncertain coin status.
Causality and geometric inference for the win, with sometimes some Bayes. Frequency is only good to see what categories of things are trending in time. Nothing else. Correlation for real world uses cases doesn’t translate well outside of that.
Please, I'm begging you, distinguish between [1] Bayes' Theorem, which is the thing you talk about at ~4:45 and which frequentists fully endorse (since it follows from the axioms, the ordinary definition of conditional probability, and classical logic), and [2] Bayes' Rule, properly so-called, which is a claim about how degrees of belief or confidences should be updated in light of evidence -- namely, that belief update should be by way of conditioning on one's evidence, i.e. Pr_new(h) = Pr_old(h|e), where e is the new evidence.
@@very-normal You say that like it's a bad thing. Surely a healthy science will have well-defined terminology that is designed to be broadly useful and not introduce unnecessary confusions. Right? Anyway, it's true that the labels are incidental, in a sense. They could be called Equation 1 and Equation 2 if you like. But they're also terms with a history, and they are part of the longstanding dispute that your video is ostensibly about. Given that context, it seems important to be extra careful about the terminology. Moreover, the *distinction* has practical implications, even if the labels don't. Frequentists can happily accept Bayes' Theorem and reject Bayes' Rule. In fact, some historically important defenders of a personalist interpretation of probability, e.g. Ramsey, have rejected Bayes' Rule (of conditionalization), but of course, they don't reject Bayes Theorem. Being clear about what you take a Bayesian to be committed to is important for understanding the debate.
I like this kind of general framing of Bayes Rule because, unless I'm going mad, the distinction only makes a difference if you're trying to get non-evidenced inputs into your results. The evidence for your prior belief and the additional evidence used to create your posterior belief could equally well be seen as one big single set of evidence. That is, if you have a prior probability belief based on your previous evidence from flipping a coin 20 times, and then you update it based on additional evidence from flipping it 20 more times, you might just as well call that total evidence a set of 40 coin flips. It won't matter if you analyse it in a single calculation including all 40 flips or as a stepwise process of expanding knowledge, 20 flips then 20 more. If your entire calulation is based on proper evidence, you can subdivide this total evidence however you like for stepwise calculation, it shouldn't make a difference to the result. So it seem to be that the only time it _would_ materially differ between the two framings (one big calculation, or lots of little stepwise calculations) is if your analysis _isn't_ actually based on proper evidence back through all of those constantly updating steps. If you are trying to sneak an initial non-evidenced belief into the analysis, obviously you can't do the alternative calulation, as you can't analyse one big total 'dataset' which is actually a mixture of both data and non-data. (The obvious follow-up question then is why are you trying to!)
When I use your views I just go through the the two three and four star reviews until I find a few that are worded and written in the way that I write and speak and think. Basically I'm looking for someone who has the same personality as me and trying to judge that through the way they leave comments which I think is actually probably a pretty robust method given the way I speak and write. Anyhow I make a choice based on those few reviews alone because I don't really care what somebody thinks about something if we have literally nothing in common because then look what what determines if something is good or bad to that person is not going to resemble what determines if something's good or bad for me.
During this video i just started to hate frequentist approach, they just simplify everything as if it's all independent. Bayesians give a guess and can iteratively get to the right probability by bayesian updates taking into account all the complex stuff the world offers. While with the frequentist approach you need to take a lot of trials.
Does constant bayesian updating also not require a lot of experimentation and trials? Not defending frequentism, but your reasoning doesn't make sense.
@@therealjezzyc6209 Thats alright. I use whatever, never thought there was a beef. But in real world averages are fine. You cannot expect to inspect every little event or data or records one by one in its details; Hence generalizations beat specialization.
@@AkshayKumar-vd5wn averages aren't exactly fine in the real world though because not all distributions have finite expectation and variance. Depends on your domain. For example, the ratio of two normal variables is cauchy, whose expected value diverges. This means that if you build a model which ends up requiring a ratio of two samples then you might not have any convergence in your sample means at all. You will need to use extreme value measurements rather than expected values, and estimate the median instead. This actually happens a lot in finance and other complicated modeling because you are working with heavy tailed distributions, so outliers actually occur quite frequently, enough to throw off your samples. Although this is just me being pedantic, I'm sure you get the point and a lot of things end up being normally distributed (but a lot of things also don't too). Typically averages are only good up until the central limit theorem holds, and you can not know whether your distribution has finite variance or expectation before performing your trials in the frequentists perspective. Which means you might not converge to your desired probabilities ever and be wasting your time. idk what you meant in your last paragraph about inspecting everything at once though.
@@therealjezzyc6209 Yeah their in some sense are two face of a medal so a lot of things are in common, in machine learning we love bayesian updates, and I might be biased by my field of study. But I feel that's the right approach to problems.
So, in short: 1) Governor et al show gross invompetence and broadcast private information of their employees. 2) Governor et al misuse their power in order to cover up their mistakes and silence the witnesses with threats and false accusations. 3) governor et al attempt to influence the legal process they falsely instigated in order to get at an innocent journalist that did them a favor. 4) After being publicly proven wrong, the governor et al persist in their defamation and malicious prosecution of the journalist. ... Hold on. Doesn't this exact playbook resemble the actions of a certain yellow gorilla? It seems the societal rot does spread from the top down.
I never understood what's going on with this "choice of significance level" stuff. In social sciences it's often 5%, in particle physics it's 0.000something. Doesn't it imply that there is a third choice? To take an example from our unfortunate days: A soldier has to either move now, or stay put. Wouldn't a 49% significance level decision be better than the alternative?
Astrophysics is by the way the only instance where I've seen clever people draw conclusions based on data which in their diagrams have an error bar that is taller than the Y-axis. So there's physics, and physics. There's stuff in space that we don't know much about, for obvious reasons.
It does imply that there is a spectrum of choices. The significance level is just a measure of how rigorous you should be. When looking at a particular situation, you are probably not seeing every variable inside that system (system is huge and complex), leading to bigger errors and higher variation among each observation of impact of each i.variable in the d.variable.
So for more controlled environments and when trying to prove theorems and turn them into laws essentially or verified characteristics, you need to be more certain. Therefore, there is a higher level of strictness (confidence level). In Financial forecasting it is normal to have a higher randomness associated with a bigger and more complex system, at smaller time frames specially, which leads to accepting lower levels of confidence in firecasts
@@calloftrading It would be nice if there was a way to quantify which confidence level to use. Taking it from the other way, and simply accepting an outcome together with its confidence level, whatever it is, isn't popular. It's looked down upon. But if one has to make a choice, as things are in reality, then the confidence level seems to me to be as much a relevant paramater as are the expected value and the spread measure. I don't quite get it why the confidence level should be somehow picked first, and only then the rest of the parameters be evaluated given a binary within or not of such an arbitrary significance. Isn't btw all of this olden Gaussian way obscolete now, that machine learning fits patterns on big data without considering stuff that were once invented only because they made data analysis simple and practical given the limits of ancient tools?
Frequentist view prohibits all notions of epistemology. So it fundamentally has no meaningful way to talk about evidence or partial knowledge. It's the reason why meta reviews are phrased so awkwardly, compared to something like civil court cases ("judged by the weight of the evidence").
A lot of it is hammers vs wrenches. There are plenty of cases where subjective Bayesian isn't appropriate at all. If a drug company did a clinical trial, and proved that their drugs works, based on an analysis that involved their own subjective prior which assumed that the drug works, would you believe them? If someone is trying to prove that climate change affects x, and they use their own prior which assumes that climate change affects x, would you believe them? These examples illustrate that objectivity is sometimes really important (where objectivity means: reducing arbitrary decisions as much as possible...clearly nothing can be completely objective). On the other hand, there are plenty of situations where you should be including subjective prior information. There is also a whole field of statistics which is frequentist Bayesian methods, which to some extent takes the best of both worlds. It uses Bayesian methods, but has the objectivity of frequentism. The real problem in statistics is over-use of maxlik, which is neither frequentist nor Bayesian.
That’s fair, but to clear something up: priors in clinical trials are often done with past studies in mind and with input from field experts, they’re not often made purely from the beliefs and feelings of a sole statistician
You could set the first parameter higher to reflect this. You could choose one to have a particular prior mean to reflect your thoughts on how rare/common the reviews are
"C.I does n't tell us if it contains the true value of PI or not, you can only know that if you repeated the experiment multiple times then most of them will. " Can you explain this statement I didn't get it.
The definition of confidence is the proportion of intervals that contain the true parameter value. Different experiment repetitions will produce different datasets, so the ends of the intervals will change depending on the data. In the same way that choosing a 5% level means you only get a type-I error in 5% of experiments, the confidence interval will contain/cover the value of the true parameter in 95% of experiments. There’s no guarantee that you know the one you calculated actually contains it or not
The problem with a lot of the current faddish enthusiasm for Bayesian analysis is that soms people are pretending to have very specific, numerical priors that are OBVIOUSLY just pulled out of thin air, at which point, it is unclear what point there is to hearing out the rest of their alleged "analysis".
@very-normal It's no doubt not a fad amongst actual statisticians but it seems to have become a mostly rhetorical gimmick in other fields, including debates over the historicity of religious figures, of all ridiculous things.
I don't really know this very well, I just remember having read it somewhere so maybe it's completely wrong, but I thought the common "uninformative" choice of he beta distribution is alpha=beta= as small as possible? IIRC the theoretically optima choice is alpha=beta=1/2 (I'm sure you'll eventually talk about that) but I've seen people argue it really should be alpha=beta=epsilon so like 1/10 or even 1/100 basically. It's impossible to set both parameters to 0 but just in principle I could have the *effect* of that, I think, by fixing my initial prior as "a beta distribution with alpha=beta=0" without worrying about the issues with that, and then just following the regular update rules and go from there, right? It's like a truly limiting-case uninformative prior I think? Or is there a good reason not to do this?
That’s a great question. In my experience, I’ve only seen Beta(1, 1), but most of my experience is in clinical trials, so maybe customs are different elsewhere? My understanding is that your initial prior parameters also influence how much the data will influence the shape of the posterior. Parameters 1 and 1 suggest you know absolutely nothing with discrete trials. But parameters 100 and 100 still look uniform but suggest you had 200 trials that went both ways the same amount of times. Data will influence the shape of the former more than the latter. Not a complete answer but I hope it helps a little bit
@@very-normal reading up a bit about it now: So alpha=beta=1 is the Bayes-Laplace prior, alpha=beta=1/2 is the Jeffreys prior and comes from a specific proof: This choice is invariant under reparameterization, i.e. (more or less) proportional to Fisher's information matrix. - That's where my suggestion about alpha=beta=1/2 came from There also is Kerman's "Neutral" prior alpha=beta=1/3 and the limiting case, Haldane's prior (alpha=beta=0) The higher alpha and beta are, the more the prior influences the posterior so in that sense, if you want literally no influence on the posterior, you really ought to go with Haldane's. In that case, the posterior mean equals the maximum likelihood estimate, but there are also plenty people arguing against that choice. For very small datasets the "uniform" choice alpha=beta=1 can be a pretty strong bias, but of course if you have LOADS of data it's gonna be fine.
That’s interesting, I hadn’t heard of these before. It definitely highlights the fact that choosing a good prior isn’t trivial, something I chose not to include in the video
I think the problem with setting both parameters to zero is that you're not "skeptical" of the data. Suppose you find a restaurant that has a single, positive review. Would you consider that to probably be a better restaurant than one where 990 people leave positive reviews and only 10 leave negative reviews? Ultimately it depends on how likely you consider any proportion of positive reviews to be. Personally, I'd say that parameters of beta=0.5 and alpha=2 work pretty well in this case. Ideally, you would find the exact rating distribution of any coffee place and use that. Also keep in mind that alpha=beta=epsilon means that you think it's either zero or one, with no middle ground. It means you don't expect the value to be a probability but merely a true/false with some accepted error.
there is a looong section on Wikipedia about the Beta Distribution titled Bayesian Inference where it compares a bunch of choices for uninformative prior and quotes a bunch of works by different people. Most often it seems that Jeffrey's prior is favored by theorists, at least as presented on that page
I have an earlier video about it, but I think my better explanation is in my “biggest prize in statistics” video. It’s in the first chapter on Bradley Efron
Bayesian statistics reminds me of Kalman filters to a certain degree. It also seems to me that frequentist statistics is the limit of Bayesian statistics as you gather more data points.
The Kalman filter is a direct application of Bayes' rule. In fact, there is evidence suggesting that Laplace may have applied a similar approach in his calculations of planetary orbits.
@@WeirdPatagonia Using non-informative priors is very different to keeping only the likelihood function. This is especially obvious when you condition your posterior on small samples.
@@xavierlarochelle2742 In rigor, you are right, in practice, it depends on the size as you say. I haven't encountered a difference yet, but it is also true that most of my analysis are with medium/big datasets. Thanks for your comment
Couldn't the null hypothesis be an inequality ? That would have been more logical to ask whether mu > 0.85. You have have an MLE of 0.88 and with a t-test you get your p-value and confidence interval but instead of the p-value you could get 1-p_value to get a similar probability than the one from the bayesian side ? I know the student distribution of the test statistic is very different from the bayesian posterior but I would make this kind of bad leap in reasoning intuitively. ^^ There isn't a test statistic for inequalities?
You could use a composite null hypothesis actually! You’d end up with a one-sided test. I’m aware of other tools for composite null hypotheses, but they’re usually outside the scope of what most statistics users would be familiar with
I have trouble when you say "you can have strange priors, but you're gonna need to justify them with evidence". There is no rigorous method of assessing whether verbal statements such as "I have a heavy prejudice against cafes like mostra" produce valid or invalid priors. If we cannot have rigor in determining the validity of priors presented in a Bayesian analysis, then we are no longer considering logic and are instead considering rhetoric and argumentation, which the frequentists are very right to point out as being a major flaw.
Is this voice AI. Really good quality, if you could provide a bit more info about how you synthesize this voice, I would be happy to share with my university team who uses AI narrators
nah that’s just my voice with some post-production lol. I’ve just I looked up YT videos on how to do it, here’s one that I’ve used: th-cam.com/video/6R1Hr2f_rCQ/w-d-xo.htmlsi=iEWKAo8DYuj-axth
10:30 and it all goes to 💩 if some unknown number of reviews are fake... You might take it to the extreme, the unknown was a blackswan event. Thats just your excuse. You chose to turn a blind eye, and hope no major factor was overlooked. And thats human behaviour. I will take statistics seriously when the introduction, basic entry example solves for conscious agents. It works for dead matter, but not if there is consciousness.
I still don't understand why the Bayesian method is not susceptible to manipulation and subjectivity. You claim that even if I arbitrarily choose the initial probability, it only makes sense if it is supported by evidence. But where does that evidence come from? From the frequentist method, right? Because if it's from the Bayesian method, then I'm stuck in a circular argument... am I not?
If a past study uses a frequentist method to analyze the data, then a new prior should be formed to reflect what that finding found. For example if a past study found the probability to be 70%, then my new study should probably make the prior on and around 70% more likely. If past studies use a Bayesian analysis, then it’s even easier. The posterior from the past study becomes the prior in the new study. The past data helps inform the prior, not so much the method was frequentist or Bayesian. You’re right that it can be hard and arbitrary to choose a prior, but that’s not a reason to abandon the method in the first place. Classic frequentist methods don’t work well with smaller sample size, yet people are taught to do it anyway
th-cam.com/video/mZBwsm6B280/w-d-xo.html This is a video on Bertrand`s paradox. As a physicist, I am not surprised that Jaynes was a physicist. In the video, each of the method he describes, leading to different probabilities, could correspound to an experiment.... a different experiement. This is of course important background information.
One major issue with frequentist statistics is that it only considers the total count of events and not their more detailed order. It would consider a coin that did 1000 heads in a row and then 1000 tails to have the same behavior as a regular coin even though that is clearly wrong.
As usual for a bayesian video, there is much bias towards complexification. First, the test should be one sided: you requested at least 85%, so please have the courtesy to do the correct one. That divides p value by 2 from scratch. Then, you do not need a confidence interval at all, you have p value. What the test tells you is that from a sample of 1074 people, there was a probability of 0.27% to get the data you got if anyone was puting 4 or 5 stars _less_ than 85% of the time (by the way, this is how you got the 99.7% "that only Bayesian gives you", supposedly....). This is the frequentist approach, and it deals with facts and makes two assumptions: independance of choices from users and validity of CLT. Then from that p value, you can do what you want, you are not even obliged to do anything, because so far you only collected data and did maths. Once the computation is done, you can _finally_ go philosophical and decide you do not live in a universe where you got unlucky to be in the 0.27%. There is no binomial, no beta, no prior, no "I don't have an idea of my prior, so I'll use uniform distribution but I will call it Beta(1,1)", no some god of philosophy told me that "no idea" meant the existence of a uniform distribution in the realm of ideas, etc... Frequentist works with facts and try, at least when they're not psychologists or marketers, to be rigorous, not forgetting the assumptions they made. They uses stats to falsify theories, and they don't put probabilities on theories which rermain true or false. Bayesians do decision making, using a tool that always works, always getting an answer whatever was the question they had. It is very good for investors who want to use some maths and have a magical tool that allow them to propose a strategy with some appearance of seriousness, and it will work whenever they were lucky with their priors. But at the end of the day, either your posterior "probability" depends a lot on your priors, and you only put a number on your feelings, or it doesn't and you didn't need to go Bayesian. Frequentists don't deal with philosophy. Bayesian do and must.
It's a shame this isn't higher but that's probably to be expected in a channel/comment section so heavily biased to one approach. The number one suggested video to follow this is literally called "the better way to do statistics" His entire interpretation of the "frequentist perspective" was purposefully limited and he tried to divorce it from reality and naturally occurring events. I'd go as far and argue that his interpretation of how to report a confidence interval was bordering on incorrect. It can, and should be, phrased practically identically to the way he talked about credibility intervals later. The entire point is that you can't know something perfectly to arbitrary confidence and the estimation of true probability can only be refined. A confidence interval is the way of quantifying this spread of uncertainty. He even contradicted himself on the definition of "repeated experiment". First he defines experiments as events that produce individual data points and then he's purposely obtuse and redefines repeating the experiment to gather another 1074 reviews. Really should have partnered with someone else to present the other side. An entire video with straw men is boring
@very-normal at least in my language, that's the name we call it in school, that for big samples, we get closer to the true value "Big number then true" It's kind of a meme en.m.wikipedia.org/wiki/Law_of_large_numbers Unfortunately, it has a more rigorous definition Nvm, you mentioned it, i was a frequentist as a joke by accident
Another interesting example is the number of permutations of n distinct elements, such that none of them stays in its original position. The answer happens to be the closest integer to n!/e.
You are offered a pair of loaded dice with an assertion of their 'loading'. Can you believe them, and how much should you pay to test them before buying. Should you start by assuming the dice are unweighted (and the sale is a confidence trick), or that the dice are weighted as offered. PS the con artist (?) did a single dice throw, to show you, before stating the weighting...
To try everything Brilliant has to offer for free for a full 30 days, visit brilliant.org/VeryNormal. You’ll also get 20% off an annual premium subscription.
At the end of one Baysian statistics lecture, the professor ended with approximately this summary:
"Frequency statistics give a mathematically rigerous answer to questions no one asked. Baysian statistics tells you what you want to know, based on assumptions no one believes."
The reason why frequentist answers to questions no one asked are still somewhat useful is that under some assumptions they are a reasonable approximation of the Baysian answers to the questions you are interested in.
@@sophigenitor examples?
@@Critical-Smoke The easiest examples are confidence intervals. What people actually want are Bayesian credible intervals. And in most practical examples, the full Bayesian treatment with flat or uninformative priors will result in credible intervals that are indistinguishable from the Frequentist confidence intervals. I have seen constructed examples where this wasn't the case, but that was caused by boundary effects of impossible parameter values.
I tend to lean toward the Bayesian approach for two reasons: It tends to be easier to build up complicated models using conditional latent variables, and the prior distribution gives a way to incorporate expert knowledge about a subject. I've worked with many subject matter experts who don't have a firm grasp of statistics, but are very knowledgeable about their own corner of the world. Having the ability to take what amounts to "vibe based reasoning" from them and quantify it using an informative prior distribution gives a lot more power than just using a flat prior.
calibration
Apparently you also tend to lack a fundamental understanding of these approaches
@@Bamawagoner tell us then
It seems like having factions that say "Pi is determined by geometry: the ratio of a circle's circumference to its diameter" and "Pi is determined by calculus: the limit of an infinite sum (pick your favourite)".
and then some people in one of the factions can’t stand it that the other one says it differently, I won’t say who
So off of an hour of TH-cam math, I’m assuming the geometry side is frequentist ideals while the calculus side is Bayesian ideals. And the geometry side will be upset at they calculus side of how to interpret it. Let me know if I’m right or wrong.
I always interpret frequentist statistics as "static statistics" and Bayesians statistics as "dynamic statistics," which works well in my field of study, robotics!
robotic statistics!
That's a really neat way of putting it
I told my Asian parents that I was Bayesian.
They disowned me.
Dude, worst time possible for me to read that comment. I just found out that my Asian crush (who reminds me of my Asian ex) at statistics classes in college has a boyfriend. Everything you wrote gave me PTSD.
ya both should've calculated the probabilities of those events happening
That's because you mispronounced "Bayesian". It's "bay-zee-uhn," not "bay-zhun."
@@nunkatsu PTSD over a crush. Is that normal?
@@xinpingdonohoe3978Probably for the statistics crowd.
Idealists will not stop to debate which approach is more valid, pragmatists will just use one or the other depending which one fits best in any given situation.
@@sumdumbmick I mean, it isn't based on dogma. That's why the guy in your story failed, no?
One of the advantages of the Bayesian approach is it feels more "natural" to incorporate non-quantitative evidence into your calculations. For example it's pretty easy in a frequentists analysis to calculate the odds of rolling a 6 on a d6 if you have rolled it a hundred times and can see the distribution of prior outcomes (fair or loaded). If instead you tell them we have no prior data but there's double sided tape on the 1 side you can easily "swag" that prior with a Bayesian approach and get better results. I've never actually seen a calculable advantage to either view but if you start fudging some numbers using a frequentist approach it just feel like you are doing something wrong... I don't actually think there is a difference or if there is I clearly don't understand it.
Question about the null hypothesis you selected as I’ve had this is ue come up as well. Why do you select H0) pi = 85%? If you want to make a decision on whether the coffee shop is good or bad, shouldn’t it make more sense to asume pi
Excuse the long explanation, but I'm trying to correct a few potential fundamental misconceptions in your question first.
First, the null hypothesis by convention is typically assumed to equal a specific value, though there are some textbooks and sources that use notation like you're suggesting for one-sided tests. The reason for the equals sign even in one-sided tests is because the actual computation of a p-value for a hypothesis test kind of requires this. (In more advanced statistics, there are ways of defining things to get around this and construct a p-value that accounts for a range of null values, but that complexity isn't necessary for this video.) The reality is that the test done in this video is basically trying to compute the p-value using a null hypothesis value that maximizes the probably of a Type I error. If you do a one-sided test, that null value will still be the one that has the equals sign, and any values "to the other side" of the null (in this case, less than 85%) would have a smaller chance of producing a Type I error. The test we're interested in produces a p-value for that maximizing case.
To put it more simply from a math standpoint: whether you're doing a one-sided or two-sided test, you still need a specific null value (not a range) to plug into the estimation formulas for the standard deviation and used in computing the z value which is then used to calculate the p-value. The specific null value (here 85% or 0.85) is the one that maximizes that Type I error probability. Any other value below 85% would give a lower p-value and thus potentially produce an inaccurate test result if used by itself. Hence, pi = 0.85 suffices for the null hypothesis.
I think what you're trying to ask here is why the video didn't do a one-sided test. The difference in that case would really be in the ALTERNATIVE hypothesis (not the null). I think you're arguing for an alternative of H_a: pi >85% rather than "not equal to 85%". Arguably, you're correct that it might be more appropriate here, as he's interested in whether the proportion exceeds 85%. But even if he did a one-sided test (in this basic case), he'd still be effectively constructing distributions to test a null hypothesis for a specific value, not a range. The null still wouldn't need to be written as
@@BobJones-rs1sd on the contrary, really appreciate you taking the time to answer that thoroughly. You are completely correct about the alternative hypothesis, that was wrong from me. The rest is just ignorance on my part, so really appreciate the explanation
@@BobJones-rs1sd I'm also thinking he left it as a two-tailed test because it was the default value for the test he did it R and didn't really think about it too much. Here, it worked out that the test rejected anyway, but yeah, doing a one-tailed test since that was the research question he was interested in might have been preferred. But it still works for this example.
@@lazerbungalow Not necessarily -- doing a one-tailed test assumes that all of the error / variation must be in one direction, and you need to provide some proof for that assumption. If you perform a one tailed test at p = .05 without such proof, you are really doing a test with p = .10. One-tailed versus 2-tailed is not a function of your hypotheses, it is a function of how error /variation happens in the world. An example of a good one-tailed test is change in kids' height from age 10 to age 12 -- we can pretty much assume that kids don't shrink from 10 to 12, and that the error/variation will only be how much they grow. But note we MUST assume no kid-shrinkage to do the 1-tailed test here.
@@otsoko66 I see what you're saying, but my initial feeling is to disagree. If he is only interested in whether it is "better," then if the sample ends up being at the end he's not interested in, he fails to reject the null. There is no type I error there. The error rate is still 0.05. If the two populations are the same, he will still only falsely reject 5% of the time.
Now, you might possibly make the argument that it affects type II error if you are saying that the two populations could be significantly different in the opposite direction that he assumes. Because while that is not his research question, a rejection in that direction is valid science because it calls into question his basic ideas about his hypothesis. So in that case, if you feel it is a convincing argument to do a two-tailed test, then that's something to go on.
But he is not really testing at 10% type I error rate, because in one direction he will fail to reject.
One option for avoiding Bayesian prior-rigging is to simply publish the calculation itself, lacking any prior. Then, the probability is simply some function of the prior, which you could graph or something to visualize it better
That seems a really interesting approach - as much for the psychology as for the mathematics. It might make it easier to soften a dogmatic prior into an introspection on the evidence/knowlege/belief that underlay the prior.
It is kind of what is the frequentist interpetation of Bayesian methods. They are “just a fancy estimate” for which you can proove things like asymltotic unbiasedness and normality, etc. Fun fact, but apart from degenerate cases, Bayesian estimates are not unbiased.
The prior is a distribution though? How are you going to graph a functional?
Exactly, it is very common to have close to no idea about the prior and somehow be able to state something useful in this way.
Bayesian view doesn't apply very well to statistical physics. There is a concept called micro canonical ensemble, where we assume the frequency of each event to be equal. From this, we can calculate entropy, and from that one calculate other physical quantities of interest like temperature. In Frequentists' view, no problem arises as the physical system's properties is independent from observers knowledge. Everyone agree on the temperature of the box of a gas. However, in Baysian point of view, if someone (say by prior measurements) have extra knowledge about the system, they would not assign equal probabilities to each individual events, causing them to claim a different entropy, temperature, etc, which would not agree with our actual observation. I have seen some effort to fix this, however, Baysian view is not as natural as this video makes it to be.
yeah that’s fair, I see what you mean. Physics is a totally different world than the biostatistics world I’m used to
@PrParadoxy I want to hear more about this! But I would argue that the Logical Bayesian approach has to do with the process of acquiring information on a real phenomenon and updating the status of knowledge of the observer. Its objects are propositions, not the physical system itself. On the other hand, for statistical physics and quantum mechanics, the probabilistic nature of the evolution of the system is a characteristic of the physical model of the system, has nothing to do with the observer (before measurement) and is objective in nature. While the system evolves, there's no update of knowledge to do for anyone. Moreover, when you build the concept of ensemble, one basically starts with the idea of taking an infinite amount of copies of the system. What this means is that when you go and do measurement of the system, when you want to check if your model is correct or not, the model itself is not deterministic but comes with probability distributions. But the Bayesjan statistics would apply in the context of characterizing statistical properties of the system (P, V, T, S, U, whatever) given the experimental data with its own uncertainties and the model which is now non-deterministic.
Does "everyone agree"? Isn't that worse than subjective in that it is also collective? What about the ratio of matter to anti-matter in the universe, how does the apriori assumption work in that case? Wouldn't it be a good idea to consider adjusting the parameter, the theory, because the data doesn't fit well with the prior.
@@bjorntorlarssonCertain physical quantities are only meaningful in absolute sense. Imagine if someone tells you the temperature of your boiling water in the kettle, is in fact 0 kelvin. They reason that they know the the micro state of each individual particle, so the amount of uncertainty, or the entropy if you will, that they have about the system is zero. It does not make sense, does it? It has nothing to do with parameterization, really.
@@bjorntorlarsson Firstly you choose your hypothesis H, which is a proposition like "This physical process can be described by this specific model which has these specific parameters". Then the prior gives a probability distribution for the parameters of that theory given all you already know, both about the physical process itself you wanna describe and the model you're trying to describe it with. Then through the likelihood what happens is that your prior knowledge about the parameters of that model is updated and you get a new probability distribution "a posteriori", that takes into account the new data. If you change the model, you consider a totally new prior associated to the new model. In the case of choosing the relative value of dark matter fraction or dark energy fraction, it's not a change in the model. If you want to be totally agnostic you choose a prior so that all the Omega_i sum to 1 but are then free to vary within the allowed range. If instead you've done multiple experiments already on the LambdaCDM model another choice could be to take as your prior the posterior given by the chain of former experiments. I don't see the issue here
Man I just love your videos, keep it up!
Bayesan probability still has the same definition as frequentist probability!!! What you are showing is not a "definition" of probability, it is just Bayes' rule, which says NOTHING of P(A), only of P(B|A). The law of large numbers gives the definition of probability, regardless of what field of maths you study. I feel like this was really misrepresented in the video.
Your comment only shows one thing, you don't understand the Bayesian viewpoint.
Back when I was doing my PhD I had a designated helper from the statistics department to coach me on methods. One day we started talking about Bayesian thinking (he was a frequentist). After trying to do some math with me being confused he stated (again as a frequentist): If I roll a die covering it with my hand the probability of it being a specific number from one to six is 1/6. As a frequentist, I say it IS one of the six(it's a physical entity) with said probability, a Bayesian will say we belive it to be something but it isn't until we discover more (remove the hand).
This distinction makes very little sense for his example, but a huge difference for more advanced statistics. (and all the natural sciences that depend on it)
As a sidenote, there is more than one kind of math and they don't always agree. Look it up.
@@foresthobo1166 Whether or not you calculate the probability as a frequentist would, what you are trying to estimate through Bayesian thinking is how likely a specific outcome is to happen. If given the means, you can verify your result using the law of large numbers by running a bunch of experiments. That is probability, regardless of what method of dealing with it you subscribe to. I feel like this is not particularly debatable or hard to understand.
@@xenoduck3189I’m a probability and statistics professor for physicists, and a little bit of a quantum information scientist. Sure enough, the probability of physical events should not depend on your interpretation of probability. That is, both a bayesian and a frequentist must agree that the probability of obtaining a particular outcome in a fair dice roll is 1/6. However, the frequentist states this due to previous experiencie with fair dices, whilst the bayesian does it due to complete lack of information of the result of the dice roll. Once the bayesian sees the results (obtains information of the system), they update their probability distribution to one where the observed result of the dice roll now has unit probability (a distribution with zero entropy, i.e. a state of complete knowledge).
An instance which may help understand the difference in the interpretations is the following: for a frequentist, the question “what is the probability that God exists?” does not make any sense and cannot be answered since there is no way of performing trials for the existence of God; something that underlies the frequentist definition of probability. On the other hand, a bayesian may say there is a 50% chance that God exists, since the answer is binary (God either exists or not) and, in this case, a 50-50 probability distribution is the one which best describes the state of complete lack of knowledge (i.e. maximum entropy).
This is incorrect. It makes sense but it isn't correct when you start thinking about events that have eventually but single state.
For example the sun WILL go supernova. 100% this will occur. But the sun WON'T go supernova today. 100% it will not occur today.
So what happens is that the Law of Large Numbers struggles with events like this. The law tells you it WON'T happen (though we know it will) because each day as an experiment suggests it won't but it also somehow tells you it WILL happen because many stars have this fate.
So how do you deduce the odds of the sun going supernova today? If you say it's zero which the LLN suggests and it happens then you were wrong but if you say it's one such as the LLN suggests and it doesn't happen you are also wrong.
Sequenced odds that are not IID do not work with LLN. It's just completely incorrect for more complex systems.
I don't know why you need to say that Bayes' rule isn't the definition of probability? When he brings it up in the video, he is pretty clear in talking about P(A|B). He is talking about the philosophical interpretation between the two viewpoints. Bayesian thinking gives us a different way of looking at our error and at our assumptions.
I think I figured this out when I first came across the Monty Hall problem. You can set up all kinds of scenarios where different people can assign different probabilities to the same question because they have different information. I ended up thinking about probability intuitively as a "measure of ignorance".
I like the channel, subscribed, keep up the good work.
great point at the end about needing to pre identify the prior _distribution_
(and hence how fast or slow the data will pull toward 'truthiness')
As someone who lives near that Mostra coffee in San Diego, I recommend it!
To understand the beta distribution:
1. Imagine you are sending rockets to an alien planet to see what portion of the surface is covered in water.
2. You can send probes that hit the ground, are destroyed instantly, but send back whether they landed on water or land.
3. Let's say you send down a probe and it says it hit land.
4. If the entire planet were water, there is a 0% chance the probe would say land. If the planet were 100% land, there is a 100% chance of it happening. If you plot the percent of land on the planet on the x axis, and the relative probability that the probe says land, you get a line from 0,0 to 1,1. You can image the opposite being true if it said water: a line from 0,1 to 1,0
5. If you send another probe and it says water, then you can combine two plots. multiply the land plot by the water plot because at each possible percent of land on the planet, the probabilities are being combined. You'll end up with a parabola.
6. Keep multiplying the right plot if the probe says water or land, slowly you'll get a bell curve where the peak is at the ratio of land and water probes. This is what the beta distribution is.
Except that this interpretation only works for positive integer parameters α, β, whereas the full scope of the distribution works for any positive real numbers α, β. An actual description of the beta distribution would require a bit more complicated “probe sampling” than described here-having troubling thinking up any alterations to the described example at 3a tho.
Why would you risk having a wrong prior when you can simply use the frequentist apprach? Genuine question
I think that for any simple problem like this, enough data will make up for any “wrong” prior.
But I’m not sure what it means to have a wrong prior in the first place
(Talking as a complete amateur here) As I understand it, having a "wrong prior" is no less fatal than making certain assumptions that aren't applicable to your circumstances in a frequentist context (given that I strongly believe there's no such thing as "simply" using a frequentist approach... or "simply" using statistics in general😅). Both will lead to the same outcome of misinterpreted or straight up wrong numbers, so I believe this is more of a "pick your poison" situation.
@@very-normal If you believe to have prior knowledge that does not match reality, I would call that a wrong or bad prior, resulting in bad results.
If that’s the case, then I’ll need to rely on someone to call me out on a bad prior. If results are going to be published, they have to be vetted. Why risk someone misusing frequentist statistics when we can force them to express their beliefs via the prior
Most of the time if you have no good guess you actually use non informative priors, which is equivalent to what you suggest. But, even with that, some people would argue that the interpretation of the findings is different, and they would be right
Very clearly and skillfully explained.Maybe one day you will do a series on the logic of science material?
Apart from maybe being easier to understand for some people, I don't get what the bayesian approach adds.
For problems with a very small frequency or sample size you prior is the thing that's going to influence the outcome the most so you are essentially guessing where the frequentist would say "no idea".
Doesn't sound like a huge improvement to me.
Edit: I guess if you HAVE to make a decision, saying "it's between 0 and 100" is better than saying "it's 50 but I'm probably wrong"
with something like statistics, the value in being easier to understand can’t be understated
There is an excellent 3blue1brown video that explains a situation where the Bayesian approach is very impactful: health screening tests. Imagine you have an extremely specific test for a rare disease (let’s say the “true” probability of having it is ~1/10,000 people), something that only gives a false positive 0.1% of the time, and a false negative 1% of the time as well. That’s a great test, right? We should give it to everyone to screen for this disease! What’s the harm, right? With such a low error (from the frequentist perspective), most people who test positive will have the disease and be able to be treated.
Well, hold on though, is that assumption true? Imagine giving this test to 1,000,000 people. Since I’ve defined this disease to have an actual objective rate of 1/10,000, about 100 people in this group actually have the disease, with 99 of them being caught by the test and 1 missed. On the other hand, since the test has a false positive rate of 0.1%, 1000 people have been given a false negative result.
That means on a test with extremely low type I and II errors, your chance of actually having the disease if you get a positive on the test is only about 10%!
That’s incredibly unintuitive from a frequentist perspective, and how one would even go about getting that number and justifying it isn’t really clear. Bayesian statistics, however, bake all of these assumptions into the calculation, so they can be interrogated and updated. That 1/10,000 number was something that I just magically knew in this example, but a Bayesian statistician can get a similar prior probability estimate from any number of sources.
The really important part of this is that it demonstrates the need for multiple screening techniques, because they represent multiple times that the probability that you are positive for a specific disease are updated. This is why, for instance, it is no longer recommended that all women get mammograms after a certain age unless there is some other indication that boosts the probability that they have breast cancer: there were too many false positives, and false positives are not free. They cause stress and anxiety for patients, they cost additional healthcare resources, and they dilute the pool of patients who actually need care with patients who only have been told they might need care.
Surely more people have visited the cafe than have left a review? So repeating the experiment and collecting another ~1k reviews from those who simply hadn't written theirs down isn't all that far fetched. Impractical, yes, but entirely possible. The idea of 'repetition' breaks down much more when we think of data we can't really sensibly resample/remeasure (like a country's annual GDP or the employment rate).
The fact that more people have visited than left a review could be already a bias for the statistics because maybe for some weird reason people writing reviews are already prejudiced in their judgement all in the same way. I guess, this could be modeled in a Bayesian approach but a frequentist just has the plain numbers and cannot take anything else into account.
To name individuals "frequentists" or "bayesians" is probably one of the most misleading thing one can do when actually trying to explain this in a helpful way.
Excellent video!! I’m almost done with Bernoulli’s Fallacy myself.
I do want to add that, for what I’ll call “reasonable” priors, the choice of prior doesn’t matter in the long run, as the data will dominate the posterior through the likelihood.
Basically, with Bayesian statistics, we’ll find the truth if we just keep on collecting more data.
Again, thanks for this great summary! I teach both a high school stats course and a high school Bayesian data science course, and this is the best short explanation of the difference I’ve seen. Congrats!
But there are many fields of science where you can't collect more data (pretty much the whole of environmental science). So then priors are critical.
@@simonpedley9729 So is this a chicken or the egg situation ? (which really doesn't make sense because eggs came first as other animals had them...like chicken's ancestors....so egg always came first!).
Ok but if you can just keep collecting more data, you can just do a frequentist analysis. If the priors eventually “drop out” of the calculation, all you’re left with is the experimental ratio.
That would seem to be a concession to the frequentist though? That is, the specific reason given for why Bayesian approaches are better is that frequentist assumptions either don't make sense (e.g. long-past or one-off events cannot be understood as having a "frequency") or refer to impossible actions (somehow collecting a brand-new, comparably-sized set and running the "experiment" again, when such data simply doesn't exist). If fixing the problem of a bad prior requires repeatedly collecting data, the Bayesian is now in exactly the same hot water as the frequentist: they both need do-overs that are impossible or nonsensical. Under those lights, the rationale seems to be in the frequentists' favor by parsimony: the Bayesian is embarked, they _must_ commit to a prior, but the frequentist does not. Instead, the frequentist commits to a particular risk of making a mistake.
Now, I'll note that I'm a pretty firm frequentist who does not have a very positive view of Bayesian methods (not least because I find a lot of Bayesian boosters make some strident and excessive claims...), but I think the point still stands. If the Bayesians' problem with frequentist methods is that the latter requires imaginary repeats, why don't they also have a problem with the risk of bad priors for things where we're only able to update our beliefs a very small number of times.
@@ZekeRaiden To add to your final comment about bad priors...the Datta, Mukerjee, Ghosh and Sweeting (2000) paper shows that the error due to having the wrong prior, and the error due to not having enough data, are the same order, O(1/n).
Do you use Manim to make your visualizations ? I love how you work through the concepts and keep the canvas as clean as possible. Keep up the great work 🎉
Yee i am a manim novice
@@very-normal Where to learn manim from your bayesian prior ?
I’m self taught from reading documentation but I’m aware of tutorial videos on TH-cam
I often look at this debate through the lens of physics models, where you can have one model that is simpler and often "good enough" in most scenarios, and another model that is much more complex and able to more accurately describe a larger number of scenarios. Examples being electron orbitals vs electron cloud, or newton vs einstein. Here, I consider the frequentist approach to be the "simpler, good enough" form and the bayesian approach to be the "complex, more accurate" form.
Excellent, concise description of the essential differences between the Bayesian & Frequentist philosophical perspectives with examples. The frequentist methodology as used today is all too often a hybrid mess of two distinct approaches. The physically separate frequentist approaches of Fisher and Neyman-Pearson have been mistakenly combined in a manner which neither Fisher nor Neyman would have approved. Bayesian findings tend to be more intuitive than frequentist results - so much so that frequentist analyses are often interpreted as in a Bayesian framework! For example, most consumers of statistical information will interpret a frequentist 95% confidence interval in the context of a bayesian 95% credible interval - as the later is much more intuitive to understand!
Confusing a confidence interval with a credible interval is just a common error. We can run an experiment to calculate pi and might find a confidence interval of (3.1,3.2), or (2.9,3.1). I guess some people might say that our belief is that the probability pi is in (3.1,3.2) is 95%, but these people are wrong.
You just gained a subscriber love your content 🌿🌿🌿
Did anyone else notice that the “prejudiced” prior distribution toward the end is not a valid probability density function?
manim has trouble drawing beta distributions
In my opinion, these two philosophies can be reconciled by thinking of frequentist statistics as just approaching the problem with a specific prior that is asking "am I X percent confident in posterior outcome A?"
Thanks for your video! I’m by no means a statistician, but I find Bayesian inference to be interesting and valuable in and of itself when you look at statistical learning. When you need to examine and theorize about the process of learning, viewing probability in terms of a belief updating process is extremely useful. So many people get stuck on the “Bayesian stats is subjective”, but if you’re looking at a machine learning model, the point is that over time it can learn and reduce its error over a training process using belief update rules. Is there a frequentist interpretation of machine learning?
> Here we are trying to understand what Mostra cafe's good rating is. We did a hypothesis t-test with 1 degrees of freedom. Since we evaluated only 1 Mostra cafe instance. Realise that we are not comparing Mostra cafe to other cafe's.
> But trying to understand the Good Rating for Mostra cafe in real life based on its Rating in Google.(Which is the sample, not the population)
Using the frequentist approach.
>We realise the google rating for Mostra cafe is around 0.88 (and we are 95% CI the actual real percentage is between 0.859 & 0.899). But we dont really know the actual probability the good rating is at 0.88, We just know that the population mean is 2 standard deviations around it.
Frequentist approch looks like a neutral pragmatic approach to statistics while the bayesian approach is more flexible and adaptive approach. I'm sure each have their own strengths in different situations. I believe frequentist approach is good as an first value estimation when you know nothing about the data you're studying while bayesian approach allows you to get more precise results as your understanding gets better. I know what I'm saying isn't mathematically rigorous but it mathematics is always derived from "desired properties" and it very much look like these approaches are developed for the desired properties they offer.
If you have any objections, let me know I'm very interested in learning more about what you think.
Great video. I always thought the frequentist approach was actually more subjective, because what is truly considered a relevant counterfactual is open to interpretation in many cases (In my view, the Bayesian is more up front about this). Alan Hájek has a lot of great work on this-in fact Hajek is worth reading in general. What may be interesting to note is that in my field (Electrical Engineering) I find most of us are Bayesians. It may be because there is an intuitive connection between Bayesian statistics and topics in information theory like entropy and mutual information.
Still, there is a promise for a hybrid view: as Roderick Little suggests, “inferences under a particular model should be Bayesian, but model assessment can and should involve frequentist ideas". Also it is interesting to note that Clayton's book Bernoulli's Fallacy borrows quite a bit from ET Jaynes (though I disagree with Clayton on a few points). Janyes was a great statistician but he was as hardcore a Bayesian as they come.
The idea of a repeated experiment showcases the inherent variability in the parameter estimate, i.e. the sampling distribution. A frequentist assumes that there could be a different data set borne by the same invisible data-generating process (law). Bayesians tend to jump onto data matrices as if those n=200 observations were the one and only realisation possible, without other hypothetical scenarios occurring, as if there were no Heisenberg principle or quantum uncertainty. The frequentist approach reflects the randomness of Nature and unobservability of hypothetical outcomes better: ‘it could have been otherwise’. Finally, Bayesians often make ridiculous distributional claims: ‘assuming the prior normal distribution, the posterior distribution of the linear regression slope estimator is precisely Student with n=198 degrees of freedom', whilst frequentists are much more careful about heteroskedasticity, calibration, coverage probability, and Bartlett correction, which are essential to control the false discovery rate: ‘there is some unknown law, but we can compute some functionals thereof regardless of the joint and marginal distributions, as long as enough finite moments exist for the WLLN and CLT to work’.
Finally, a Frequentist defense.
There’s a lot of Bayesian bullshit using parametrised distributions in their posterior
I have hated the law of large numbers ever since I had the misfortune of learning about it in high school
How would you explain the choice of the likelihood distribution chosen?
I viewed the data as binary, so it needed to come from a discrete distribution. The binomial is the most commonly used family for this, but nothing would stop me from other discrete distributions that also fit binary data. I would lose conjugacy, but there are tools for doing Bayesian things in that case
New found love for statistics....Thank you so much!!
One of the best ad segues I've ever heard. 💀
This video gives the impression that Bayes Theorem is exclusively part of Bayesian probability theory. It's basic set theory and applies whether you are a frequentist or Bayesian. Another issue is that the finite number of measurements applies across all of physics. You cannot, for example, calculate an instantaneous velocity. You can only measure position either side of a finite time interval - and calculate an average velocity. That does not, however, mean that you cannot use calculus in your physical model and cannot use the concept of an instantaneous velocity. Likewise, although you can only repeat an experiment a finite number of times, you can use mathematics to model an infinite number of experiments. We are free, therefore, to use the mathematics of infinite sequences in probability theory. It's not even necessary to believe that there is an absolute underlying probability: only that you can usefully model a scenario using the mathematical concepts of absolute probabilities and relative frequency as the limit of an infinite sequence of experiments. That doesn't need to be practically achievable in order to be a valid mathematical model. Otherwise, physics would have to rely solely on the mathematics of finite numbers!
Finally, I don't agree that a Bayesian can believe that A has a 20% probability and not A a 50% probability. That would be absurd. The priors have to be consistent. In fact, both frequentists and Bayesians are essentially tied to the Kolmogorov axioms.
What is probability? : one also needs to compare and contrast that with 'statistics' as either synonyms or distinctions to help with discussion.
The frequentist 'close enough for practical purposes' get out also isn't great from an engineering perspective either ('when will the bridge fall down?', 'tracking a radar blip', etc.).
I feel that the Bayes formula starts as a 'complicated' (tricky to visualise) formula, and that P(A & B) = P(A|B).P(B) = P(B|A).P(A) is an easier starting point that is just as simple as frequentist counting with the same underlying assumptions (Belief: identically and consistency of the independent events)...
Yes, I totally agree that for large sample sizes the two methods basically give the same asnwer. They are also compatible in the sense that Bayesians have their own interpetarion for maximum likelihood and Bayesion methods can be analysed via frequentist language. (Infact it is more natural to understand the limit theorem mention in the video in frequentist terms, in my opinion.)
Still, I want to make some remarks.
Firstly, I’m sort of a prularist. I don’t think probability stands for a single concept. Statements like “what is the probability that this previously unknowm sonnet was written by Shakespeare” can be interpeted in a Bayesian way much more generally, while physical problems (see belowe) makes more sense in the frequentist interpetation. Ultimately, there are many things that satisfied by the Kolmogorov axioms that has nothing to do with randomness. (Say the ratio of votes in an election.) It is possible to do probability theory without reffering to randomness at all.
There are cases when we do actually talk about frequencies in the world. Ergodicity is s good example. Saying things like “if I know the exact inital conditions, I can calculate the exact ratio of times the coins will land on head” and therefore “probabilites are purely epistemic” kind of misses the point. I’m not interested in this very particular initial condition. I want to show that this behaviour when roughly 1/2 of the coin tosses lands on head is typical for most of the initial conditions. This 1/2 number is a property of the system, and it doesn’t describe the mental state of an idealised, rational observer. (With the obvious objection that of cource observations themselves are model dependent, etc.)
Lastly, all the populat interpetations have their own philosophycal problems. I don’t know any interpetation that are not ultimately flawed under greater scrutiny. This is actually very typical when it comes to phylosophical problems. (Think about all the different schools of ethics.) I think I like the propensity interpetation of probability the most, but that is not perfect either.
That Brilliant joke was, well, brilliant
Btw - my arguments with statistics do not mean they are useless. Rather, they are frequently and all too easily abused (intentionally or unintentionally). In every instance, the use of statistics as applied to real world analyses must be critically analyzed and scrutinized, starting with the assumptions, presumptions and desires and relative ignorance of those involved. Even when all of that is fair, statistics often goes wildly away from truth or reality.
And researchers often fail to apply even the most basic critical analyses to the results.
Is the population a single population? Is the population linearly, triangularly, normally, poison or other distributed? Are there hidden variables? Is the data the result of stochastic events acting on stochastic events? Do the results violate sanity? Do the results suggest results outside the bounds of the analysis? Is the thesis or hypothesis that resulted in the data gathered biased in its own rights? Etc...
I got lost. With the coffee shop exercise, "The probability it receives a 4 or 5 star review". Receive from whom? Does it mean it 'has' received by past customers or is it ideally about the coffee shop's track record at the end of time by its reviewers?
Very educational video! 🎉😊
When I use machine learning algorithm to predict stuff, is it the bayesian way or the frequentist way ? or something between both or does it really depends on the data distribution or depend of the specific machine learning algorithm ?
I think it depends on the model. For prediction, I don’t think the distinction matters all that much. I don’t work a lot with prediction but this has been my experience
But for inference, it changes how you do statistics and interpret results
14:20 LOL, always bugs me. "Consider multiple universes...." is both difficult to communicate and interpret. "Is our prediction right or not? Are we 95% sure this number is right?" seems much more meaningful
Bootstrap
You actually don't have to consider multiple universes. Does an integral bug you as well? Also, the are we 95% sure thing would actually be, given our prior, we are 95% sure. Bayesians who champion a prior seem to forget about it sometimes :p
@@ucchi9829 I might be missing some details about integrals, but they've got straightforward definitions? Googling "confidence interval misinterpretion" brings up a lot of results. The issue I have, and maybe it's my own educational one, is that stats/plot libraries output this range and it's "wrong" to interpret it as 95% probability that the metric of interest is in that range.
20:04 - "You can have strange priors, but you're going to have to justify them with evidence."
But in that case, they're not really a subjective prior at all, are they? If they're properly evidence-based, then they're objective, surely? And in that case, the fundamental objective/subjective difference that you'd previously described is no longer there.
An objective prior followed by an objective analysis gives an objective result, and a subjective prior followed by an objective analysis gives a result that is to some extent not evidence-based!
The priors are subjective, because they are still degrees of belief, not real properties of some object. People disagree what is "properly evidence-based" and that is reflected in their priors. Two people with the same base evidence would also have the same subjective priors, but having the exact same evidence isn't really possible for humans.
If you had literally all the information about everything you could theoretically get "objective priors" but in that case you aren't dealing with probabilities anymore but just know the correct answer. An "objective prior" could really only be 0 or 1, because in a Baysian sense "objective probabilites" don't exist.
@@happyduck1 Of course it's possible to have two humans with the exact same evidence. If you run some sort of experimental trial, the maths shouldn't change on the basis of which particular researcher analyses the dataset that you obtain from that trial. (When I say "shouldn't" I'm assuming of course that science "should" be objective and evidence-based, of course, but that should be taken as given: it's definitional to science!)
I'm not a mathematician or a statistician as such, but I am an engineer with a background in research so it's not like I'm averse to the numerical and the analytical. I periodically try and understand what this supposed crucial rift is between frequentist and Bayesian stats and it always _appears_ to me to come back to two possibilities, as far as I understand them:
a) there's no real difference at all. They're just two different ways of framing the same underlying ideas, which might be practically useful (based on how you expect data to come in over time, for example), but does not mean there's any distinctions of principle between the two. This is the appearance I'm often left with, but my certainty is challenged when I hear statisticians insist that there really are important underlying distinctions of principle that really do make a real difference.
b) there actually is a meaningful difference between the two, and that difference is around this notion of subjective belief in the priors. That there wouldn't be a difference if the priors were truly objective, but they don't have to be, and that is where the difference of principle creeps in. And if that really is the point of distinction between the two approaches, then that strikes me as nothing more than an attempt to "launder" subjective prior beliefs which cannot be stringently evidenced into an result which has the _appearance_ (since it came out of a statistical formula containing lots of actual evidence in the non-prior part) of objectivity. And that seems to me not to be science!
Not a statistician, but I do have a take on this...
The Bayesian method relies on priors which hamstrings the whole practical purpose of analysis. Instead of debating results, people instead debate priors. It just shifts the whole thing from one frying pan to the other.
The simplistic frequentist approach you described is utterly naive. You completely missed the whole concept of random walk.
In practice, the most reliable approach to probability is the non-naive version with a large dataset, or a large set of datasets.
Random walk is critical to understand for the frequentist approach to make any sense.
For example, imagine flipping a balanced coin 4 times (small example, easier to explain)
The naive approach would assume that larger datasets tend towards 50% heads, but this doesn't make sense.
The probabilities are:
0% heads -> 1/16
25% heads -> 4/16
50% heads -> 6/16
75% heads -> 4/16
100% heads -> 1/16
It's a bell curve centered at 50%. With large data sets, your chance of getting the expected 50% result is only around 6/16, but your chances of getting either 25% or 75% is 8/16... Which means the naive approach is more likely to give an inaccurate result!
Random Walk (results steering away) is a huge topic in itself and definitely needs to be accounted for to rely on the frequentist method.
how would it accounting for it help us understand frequentist methods any better
How come we assume everything is a Gaussian or treat things as if they where? Alot of statisticial tests rely on it, but it seems like all of the factors for the test to be valid is always not respected.
central limit theorem
@@very-normal CLP doesn't always hold though, especially if your working with a distribution where higher moments diverge. In finance and physics this can happen more often.
@@therealjezzyc6209 Levy distributions, for instance.
@@Impatient_Ape my goto example is the Cauchy distribution because it looks normal, but it's expected value is infinite. It is also the ratio of two normal random variables so it's actually easy to unknowingly make a model cauchy if you start looking at ratios.
Have you heard of non-parametric statistics?
To me this seems like the frequentist approach starts with "the experiments is all we know" and therefore you can calculate the probability directly from definition, while bayesian starts with some belief about what we expect and we try to use not just the experiment data but also other knowledge we may have.
Wouldn't then bayesian approach with uninformative prior always reproduce (correctly done) frequentist approach? The frequentist approach is based on the implicit assumption that every possibility is equally likely, with bayesian you don't necessarily have that assumption, you may provide it explicitly though.
what do you mean by every possibility
This may be a dumb response. I think 'every possibility' means the support of the random parameter/variable in question.
It's very arguable if the idea of probability itself is a fundamentally real thing. Like as a mini example the digits of pi behave random by every single metric we know, yet they are determenistic and nothing random is happening.
The ultimate goal of probability is modeling unknown outcomes and that can be done in many ways.
So there is no true right option, all we care for is how accurate we can predict things and how interpretable it is to us.
(ps in my eyes Bayesian feels more true to real life and my thinking)
I'm not sure what you mean by "real" here. Casinos make real profits. A digit of pi, selected at random, (it is believed, but not proven) has an equal probability of being any number. Meanwhile, a digit of 50/99 (in base 10), selected at random, will be either 0 or 5 with equal probability. These things seem real to me.
@@weetabixharry I meant the sequence is random but deterministic, if you pick random digits you introduce other randomness. My thinking is multiple things appear to us as something random but if we knew the underlying dynamics we could often agree that probability theory is the wrong approach. Lets imagine an event i can only measure a single time like "Alex immediately says yes if ask him on a date today.", the idea of doing repeated trials is not real unless i have access to parallel universes, and taking other variables into account to refine my guess like comparing with other people i asked gives confidence but doesn't fundamentally reflect Alex choice then. Even if we measured every atom interaction in Alex brain we get into discussions of quantum and chaos theories. So even if our best models say the probability was 50% we cant tangibly experience or measure that 50% since we only see one outcome.
@@seriousbusiness2293 I think I see roughly what you're saying... and it's uncomfortable to think about. I only feel relatively comfortable in the simple cases where the tests are repeatable and the "parallel universes" all behave the same. For example, I need 1000 dice all rolled in parallel to have the same statistical behavior as 1 die rolled 1000 times. And my dice have to have a *known* probability distribution (preferably, perfectly uniform) or I'm gonna panic.
@@weetabixharry haha 😂 i feel ya. Ya in any case im sure that probability theory is an extremely good tool for reasoning and decision making and often close to some Truth. But as soon as we get philosophical about the fundamentals then there is room for doubt.
I think its comparable to the situation of going from Newtons Theories to Relativity Theory. Having a fixed frame of reference makes the math easy and it works most of the time but if you care about fundamentals and edge cases you need a relative model of physics.
Thinking about dice and cards is more a clean setup like a Newton model that assumes each object has some absolute probability making for an actually very good model. But converting any probability number into a tangible real world concept may not always work and may need a more nuanced idea of what that number means, like in relativity we found that two observers can disagree on a space or time measurement but that gets fixed if you talk about the new concept of space-time.
I would love to have another example with fewer points and were results are much differents
You got my butt with that ad transition 😅
gottem
ok, hope no one's brain was fried😂 19:42
I take offense at that remark against statisticians on their incapacity for violence.
I'd have you know, Sir, that statisticians are just as likely to commit violent crimes but have less probability of being caught because they know how not to become a statistic.
I didn't understand why in the comparison when he mentioned bootstrap he didn't mention or do a one sided freq test
what would doing a one-sided test have changed
If anyone is interested in if there is an objective way to pick a prior probability distribution, you do it with something called "maximum entropy".
And the entropy they refer to is the same one the physicists talk about.
I disagree. In the case of the p parameter for Bernoulli would be the uniform distribution. That is, however, depends on the coordite system you choose, as opposied to other methods like Jefferey’s prior. Maximal entropy arguments in general rely on some assumption of a unoform distributions even in physics. (Think about the whole combinatoric derivation with Stirling’s formula.)
Ultimately, all models depends on assumptions let it be frequentist or Bayesion. There is no such thing as “purely leting the data talking for itself”.
@@danielkeliger5514 I agree that there is never a way to "let the data talk for itself". I think i misused the term "objective". There are reasons to use the maxent distribution to ensure you aren't adding any "hidden" assumptions to your analysis.
@@Skeleman I totally agree that uninformative priors are great tools for mitigating subjectivity. I just don’t belive in logical positivism :)
I noticed that my interpretation of the frequentist confidence interval is quite Bayesian, and I have seen this often in courses as well. What is your take on this @very-normal ?
These courses are mistaking the frequentist interpretation for the Bayesian interpretation
i know i am probably focusing on the wrong thing, but shouldn't the cafe example be using a one tailed test? 😖
it could be, but it wouldn’t really change the results of the test
@@very-normal it wouldn't indeed. thanks for the wonderful and interesting video 🙏
Another advantage of Bayesian statistics is that the joint posterior allows for the calculation of the marginal distributions for the parameters and probability statements can be made regarding these parameters.
What program did you use to do research
I’m not quite sure what you mean, but I do use Obsidian to collect and organize all my research in general
There is a mechanics anologue to this:
Do you use classical mechanics or include relativistic effects? Depends. If classical is good enough you use that because relativism reduces to classical for simple and slow systems.
Frequentist or Bayesian? Same reasoning. Depends. If your problem is described well enough (or perfectly) by frequentist approaches you use that, otheise Bayesian. Because why would you shoot yourself in the foot intentionally just to do it the more complicated way?
Interesting video. I'm a little bit surprised, though. I'm fairly confident (let's say 0.80) that the uninformative prior for the binomial distribution in a beta distribution with parameters alpha=beta=1/2. I'm using Jeffrey's priors. If there's something I'm missing, I'd like to know.
It doesn’t matter much in this context because there’s so much data that it dominates the posterior.
From my perspective, the prior parameters can represent “past” successes and failures, and Beta(1,1) just says we saw only one of both. Having 0.5 of a success doesn’t make as much sense, but it still works in the end. In a paper, we might justify our priors slightly differently
@@very-normal, I concur that the alpha and beta parameters are directly linked to the numbers of successes and failures. Jeffrey 's priors are proportional to the square root of the determinant of Fisher's information matrix, it cannot be as readily interpreted. If other methods for uninformative priors exist, I'm interested. Thanks and thanks for the video!
Bayesianism is just superior. It allows for straightforward statistical connectives and gives us distributions rather than rigid numbers. It’s just a lot richer and might also lend itself more readily to generalizations of statistics once we understand them better (eg negative probabilities and so on)
I guess ,probability is derived from geometric property of our microspace (like general relative theory is derived from timespace geometry). So frequentist approach is more relevant.
“…For some reason.” 0:37 At least you’re honest about you bias right off the bat.
I thought probability distributions were green’s functions
The Bayesian camp drives artificial intelligence. It is a viable approach by the grace of Big Data. It is a double-edged sword. It can sometimes ferret out subtle patterns that humans would miss, but there is risk of conflating correlation and causation.
The frequentist approach works best if you have a theoretically perfect coin with an exact 50-50 chance of heads or tails. The Bayesian approach works best if you CANNOT be sure in advance whether a coin is loaded or honest, but want to make the best estimate as to the outcome of the next throw, regardless of the uncertain coin status.
Causality and geometric inference for the win, with sometimes some Bayes. Frequency is only good to see what categories of things are trending in time. Nothing else. Correlation for real world uses cases doesn’t translate well outside of that.
Please, I'm begging you, distinguish between [1] Bayes' Theorem, which is the thing you talk about at ~4:45 and which frequentists fully endorse (since it follows from the axioms, the ordinary definition of conditional probability, and classical logic), and [2] Bayes' Rule, properly so-called, which is a claim about how degrees of belief or confidences should be updated in light of evidence -- namely, that belief update should be by way of conditioning on one's evidence, i.e. Pr_new(h) = Pr_old(h|e), where e is the new evidence.
semantics
@@very-normal You say that like it's a bad thing. Surely a healthy science will have well-defined terminology that is designed to be broadly useful and not introduce unnecessary confusions. Right? Anyway, it's true that the labels are incidental, in a sense. They could be called Equation 1 and Equation 2 if you like. But they're also terms with a history, and they are part of the longstanding dispute that your video is ostensibly about. Given that context, it seems important to be extra careful about the terminology. Moreover, the *distinction* has practical implications, even if the labels don't. Frequentists can happily accept Bayes' Theorem and reject Bayes' Rule. In fact, some historically important defenders of a personalist interpretation of probability, e.g. Ramsey, have rejected Bayes' Rule (of conditionalization), but of course, they don't reject Bayes Theorem. Being clear about what you take a Bayesian to be committed to is important for understanding the debate.
@@very-normal I want to be charitable and say you're just trolling.
I like this kind of general framing of Bayes Rule because, unless I'm going mad, the distinction only makes a difference if you're trying to get non-evidenced inputs into your results.
The evidence for your prior belief and the additional evidence used to create your posterior belief could equally well be seen as one big single set of evidence.
That is, if you have a prior probability belief based on your previous evidence from flipping a coin 20 times, and then you update it based on additional evidence from flipping it 20 more times, you might just as well call that total evidence a set of 40 coin flips. It won't matter if you analyse it in a single calculation including all 40 flips or as a stepwise process of expanding knowledge, 20 flips then 20 more.
If your entire calulation is based on proper evidence, you can subdivide this total evidence however you like for stepwise calculation, it shouldn't make a difference to the result.
So it seem to be that the only time it _would_ materially differ between the two framings (one big calculation, or lots of little stepwise calculations) is if your analysis _isn't_ actually based on proper evidence back through all of those constantly updating steps. If you are trying to sneak an initial non-evidenced belief into the analysis, obviously you can't do the alternative calulation, as you can't analyse one big total 'dataset' which is actually a mixture of both data and non-data. (The obvious follow-up question then is why are you trying to!)
When I use your views I just go through the the two three and four star reviews until I find a few that are worded and written in the way that I write and speak and think. Basically I'm looking for someone who has the same personality as me and trying to judge that through the way they leave comments which I think is actually probably a pretty robust method given the way I speak and write.
Anyhow I make a choice based on those few reviews alone because I don't really care what somebody thinks about something if we have literally nothing in common because then look what what determines if something is good or bad to that person is not going to resemble what determines if something's good or bad for me.
what would you do if none of the reviews talk like you
During this video i just started to hate frequentist approach, they just simplify everything as if it's all independent. Bayesians give a guess and can iteratively get to the right probability by bayesian updates taking into account all the complex stuff the world offers. While with the frequentist approach you need to take a lot of trials.
Lot of trials leads to the average of outcomes which can be analyzed than a focused analysis done one time.
Does constant bayesian updating also not require a lot of experimentation and trials? Not defending frequentism, but your reasoning doesn't make sense.
@@therealjezzyc6209 Thats alright.
I use whatever, never thought there was a beef.
But in real world averages are fine.
You cannot expect to inspect every little event or data or records one by one in its details;
Hence generalizations beat specialization.
@@AkshayKumar-vd5wn averages aren't exactly fine in the real world though because not all distributions have finite expectation and variance. Depends on your domain. For example, the ratio of two normal variables is cauchy, whose expected value diverges. This means that if you build a model which ends up requiring a ratio of two samples then you might not have any convergence in your sample means at all. You will need to use extreme value measurements rather than expected values, and estimate the median instead. This actually happens a lot in finance and other complicated modeling because you are working with heavy tailed distributions, so outliers actually occur quite frequently, enough to throw off your samples. Although this is just me being pedantic, I'm sure you get the point and a lot of things end up being normally distributed (but a lot of things also don't too). Typically averages are only good up until the central limit theorem holds, and you can not know whether your distribution has finite variance or expectation before performing your trials in the frequentists perspective. Which means you might not converge to your desired probabilities ever and be wasting your time.
idk what you meant in your last paragraph about inspecting everything at once though.
@@therealjezzyc6209 Yeah their in some sense are two face of a medal so a lot of things are in common, in machine learning we love bayesian updates, and I might be biased by my field of study. But I feel that's the right approach to problems.
So, in short:
1) Governor et al show gross invompetence and broadcast private information of their employees.
2) Governor et al misuse their power in order to cover up their mistakes and silence the witnesses with threats and false accusations.
3) governor et al attempt to influence the legal process they falsely instigated in order to get at an innocent journalist that did them a favor.
4) After being publicly proven wrong, the governor et al persist in their defamation and malicious prosecution of the journalist.
... Hold on. Doesn't this exact playbook resemble the actions of a certain yellow gorilla? It seems the societal rot does spread from the top down.
I never understood what's going on with this "choice of significance level" stuff. In social sciences it's often 5%, in particle physics it's 0.000something. Doesn't it imply that there is a third choice? To take an example from our unfortunate days: A soldier has to either move now, or stay put. Wouldn't a 49% significance level decision be better than the alternative?
Astrophysics is by the way the only instance where I've seen clever people draw conclusions based on data which in their diagrams have an error bar that is taller than the Y-axis. So there's physics, and physics. There's stuff in space that we don't know much about, for obvious reasons.
It does imply that there is a spectrum of choices. The significance level is just a measure of how rigorous you should be. When looking at a particular situation, you are probably not seeing every variable inside that system (system is huge and complex), leading to bigger errors and higher variation among each observation of impact of each i.variable in the d.variable.
So for more controlled environments and when trying to prove theorems and turn them into laws essentially or verified characteristics, you need to be more certain. Therefore, there is a higher level of strictness (confidence level).
In Financial forecasting it is normal to have a higher randomness associated with a bigger and more complex system, at smaller time frames specially, which leads to accepting lower levels of confidence in firecasts
@@calloftrading It would be nice if there was a way to quantify which confidence level to use. Taking it from the other way, and simply accepting an outcome together with its confidence level, whatever it is, isn't popular. It's looked down upon.
But if one has to make a choice, as things are in reality, then the confidence level seems to me to be as much a relevant paramater as are the expected value and the spread measure. I don't quite get it why the confidence level should be somehow picked first, and only then the rest of the parameters be evaluated given a binary within or not of such an arbitrary significance.
Isn't btw all of this olden Gaussian way obscolete now, that machine learning fits patterns on big data without considering stuff that were once invented only because they made data analysis simple and practical given the limits of ancient tools?
@@bjorntorlarsson You always can resort to the p-value that indicates you the invalidation point of the significance level
Frequentist view prohibits all notions of epistemology. So it fundamentally has no meaningful way to talk about evidence or partial knowledge.
It's the reason why meta reviews are phrased so awkwardly, compared to something like civil court cases ("judged by the weight of the evidence").
A lot of it is hammers vs wrenches. There are plenty of cases where subjective Bayesian isn't appropriate at all. If a drug company did a clinical trial, and proved that their drugs works, based on an analysis that involved their own subjective prior which assumed that the drug works, would you believe them? If someone is trying to prove that climate change affects x, and they use their own prior which assumes that climate change affects x, would you believe them? These examples illustrate that objectivity is sometimes really important (where objectivity means: reducing arbitrary decisions as much as possible...clearly nothing can be completely objective). On the other hand, there are plenty of situations where you should be including subjective prior information.
There is also a whole field of statistics which is frequentist Bayesian methods, which to some extent takes the best of both worlds. It uses Bayesian methods, but has the objectivity of frequentism.
The real problem in statistics is over-use of maxlik, which is neither frequentist nor Bayesian.
That’s fair, but to clear something up: priors in clinical trials are often done with past studies in mind and with input from field experts, they’re not often made purely from the beliefs and feelings of a sole statistician
@@very-normal yes…there’s a big philosophical distinction between subjective priors and priors from previous studies
Is Han "Never tell me the odds" Solo a Bayesian?
Your Hypothesis should be one-sided in classical statistics p>=.85
What prior could you use to account for the fact that the tails are probably heavy. I.e. 3 star reviews are a lot rarer than 1 and 5 star reviews?
You could set the first parameter higher to reflect this. You could choose one to have a particular prior mean to reflect your thoughts on how rare/common the reviews are
"C.I does n't tell us if it contains the true value of PI or not, you can only know that if you repeated the experiment multiple times then most of them will. " Can you explain this statement I didn't get it.
The definition of confidence is the proportion of intervals that contain the true parameter value. Different experiment repetitions will produce different datasets, so the ends of the intervals will change depending on the data.
In the same way that choosing a 5% level means you only get a type-I error in 5% of experiments, the confidence interval will contain/cover the value of the true parameter in 95% of experiments. There’s no guarantee that you know the one you calculated actually contains it or not
The problem with a lot of the current faddish enthusiasm for Bayesian analysis is that soms people are pretending to have very specific, numerical priors that are OBVIOUSLY just pulled out of thin air, at which point, it is unclear what point there is to hearing out the rest of their alleged "analysis".
i was not aware bayesian analysis was a fad lol
@very-normal It's no doubt not a fad amongst actual statisticians but it seems to have become a mostly rhetorical gimmick in other fields, including debates over the historicity of religious figures, of all ridiculous things.
The problem with Bayesianism is the assumption that the data will conform to these parametric distributions, in the real world this is never the case.
i think that’s a general problem for statistical models
Bayesian propaganda 😂
propaganda for great posteriors
I don't really know this very well, I just remember having read it somewhere so maybe it's completely wrong, but I thought the common "uninformative" choice of he beta distribution is alpha=beta= as small as possible?
IIRC the theoretically optima choice is alpha=beta=1/2 (I'm sure you'll eventually talk about that) but I've seen people argue it really should be alpha=beta=epsilon so like 1/10 or even 1/100 basically.
It's impossible to set both parameters to 0 but just in principle I could have the *effect* of that, I think, by fixing my initial prior as "a beta distribution with alpha=beta=0" without worrying about the issues with that, and then just following the regular update rules and go from there, right? It's like a truly limiting-case uninformative prior I think?
Or is there a good reason not to do this?
That’s a great question. In my experience, I’ve only seen Beta(1, 1), but most of my experience is in clinical trials, so maybe customs are different elsewhere?
My understanding is that your initial prior parameters also influence how much the data will influence the shape of the posterior. Parameters 1 and 1 suggest you know absolutely nothing with discrete trials. But parameters 100 and 100 still look uniform but suggest you had 200 trials that went both ways the same amount of times. Data will influence the shape of the former more than the latter.
Not a complete answer but I hope it helps a little bit
@@very-normal reading up a bit about it now: So alpha=beta=1 is the Bayes-Laplace prior, alpha=beta=1/2 is the Jeffreys prior and comes from a specific proof: This choice is invariant under reparameterization, i.e. (more or less) proportional to Fisher's information matrix. - That's where my suggestion about alpha=beta=1/2 came from
There also is Kerman's "Neutral" prior alpha=beta=1/3 and the limiting case, Haldane's prior (alpha=beta=0)
The higher alpha and beta are, the more the prior influences the posterior so in that sense, if you want literally no influence on the posterior, you really ought to go with Haldane's. In that case, the posterior mean equals the maximum likelihood estimate, but there are also plenty people arguing against that choice.
For very small datasets the "uniform" choice alpha=beta=1 can be a pretty strong bias, but of course if you have LOADS of data it's gonna be fine.
That’s interesting, I hadn’t heard of these before. It definitely highlights the fact that choosing a good prior isn’t trivial, something I chose not to include in the video
I think the problem with setting both parameters to zero is that you're not "skeptical" of the data.
Suppose you find a restaurant that has a single, positive review. Would you consider that to probably be a better restaurant than one where 990 people leave positive reviews and only 10 leave negative reviews?
Ultimately it depends on how likely you consider any proportion of positive reviews to be. Personally, I'd say that parameters of beta=0.5 and alpha=2 work pretty well in this case. Ideally, you would find the exact rating distribution of any coffee place and use that.
Also keep in mind that alpha=beta=epsilon means that you think it's either zero or one, with no middle ground. It means you don't expect the value to be a probability but merely a true/false with some accepted error.
there is a looong section on Wikipedia about the Beta Distribution titled Bayesian Inference where it compares a bunch of choices for uninformative prior and quotes a bunch of works by different people. Most often it seems that Jeffrey's prior is favored by theorists, at least as presented on that page
Can you talk more about bootstrapping?
I have an earlier video about it, but I think my better explanation is in my “biggest prize in statistics” video. It’s in the first chapter on Bradley Efron
The basis of the science crisis
Bayesian statistics reminds me of Kalman filters to a certain degree. It also seems to me that frequentist statistics is the limit of Bayesian statistics as you gather more data points.
Or that frequentist is bayesian statistics with non informative priors (keeping only with the likelihood function)
The Kalman filter is a direct application of Bayes' rule. In fact, there is evidence suggesting that Laplace may have applied a similar approach in his calculations of planetary orbits.
@@WeirdPatagonia Using non-informative priors is very different to keeping only the likelihood function. This is especially obvious when you condition your posterior on small samples.
@@xavierlarochelle2742 In rigor, you are right, in practice, it depends on the size as you say. I haven't encountered a difference yet, but it is also true that most of my analysis are with medium/big datasets. Thanks for your comment
Couldn't the null hypothesis be an inequality ? That would have been more logical to ask whether mu > 0.85. You have have an MLE of 0.88 and with a t-test you get your p-value and confidence interval but instead of the p-value you could get 1-p_value to get a similar probability than the one from the bayesian side ? I know the student distribution of the test statistic is very different from the bayesian posterior but I would make this kind of bad leap in reasoning intuitively. ^^ There isn't a test statistic for inequalities?
Sorry it's pi not mu, need to practice my greek alphabet.
And it should probably be 1_p_value/2 for the symmetry.
You could use a composite null hypothesis actually! You’d end up with a one-sided test. I’m aware of other tools for composite null hypotheses, but they’re usually outside the scope of what most statistics users would be familiar with
I have trouble when you say "you can have strange priors, but you're gonna need to justify them with evidence". There is no rigorous method of assessing whether verbal statements such as "I have a heavy prejudice against cafes like mostra" produce valid or invalid priors. If we cannot have rigor in determining the validity of priors presented in a Bayesian analysis, then we are no longer considering logic and are instead considering rhetoric and argumentation, which the frequentists are very right to point out as being a major flaw.
Is this voice AI. Really good quality, if you could provide a bit more info about how you synthesize this voice, I would be happy to share with my university team who uses AI narrators
nah that’s just my voice with some post-production lol.
I’ve just I looked up YT videos on how to do it, here’s one that I’ve used: th-cam.com/video/6R1Hr2f_rCQ/w-d-xo.htmlsi=iEWKAo8DYuj-axth
10:30 and it all goes to 💩 if some unknown number of reviews are fake...
You might take it to the extreme, the unknown was a blackswan event. Thats just your excuse. You chose to turn a blind eye, and hope no major factor was overlooked. And thats human behaviour.
I will take statistics seriously when the introduction, basic entry example solves for conscious agents.
It works for dead matter, but not if there is consciousness.
bro chill it’s just coffee reviews
I still don't understand why the Bayesian method is not susceptible to manipulation and subjectivity. You claim that even if I arbitrarily choose the initial probability, it only makes sense if it is supported by evidence. But where does that evidence come from? From the frequentist method, right? Because if it's from the Bayesian method, then I'm stuck in a circular argument... am I not?
If a past study uses a frequentist method to analyze the data, then a new prior should be formed to reflect what that finding found. For example if a past study found the probability to be 70%, then my new study should probably make the prior on and around 70% more likely.
If past studies use a Bayesian analysis, then it’s even easier. The posterior from the past study becomes the prior in the new study.
The past data helps inform the prior, not so much the method was frequentist or Bayesian. You’re right that it can be hard and arbitrary to choose a prior, but that’s not a reason to abandon the method in the first place. Classic frequentist methods don’t work well with smaller sample size, yet people are taught to do it anyway
th-cam.com/video/mZBwsm6B280/w-d-xo.html
This is a video on Bertrand`s paradox. As a physicist, I am not surprised that Jaynes was a physicist. In the video, each of the method he describes, leading to different probabilities, could correspound to an experiment.... a different experiement. This is of course important background information.
One major issue with frequentist statistics is that it only considers the total count of events and not their more detailed order. It would consider a coin that did 1000 heads in a row and then 1000 tails to have the same behavior as a regular coin even though that is clearly wrong.
As usual for a bayesian video, there is much bias towards complexification.
First, the test should be one sided: you requested at least 85%, so please have the courtesy to do the correct one. That divides p value by 2 from scratch. Then, you do not need a confidence interval at all, you have p value. What the test tells you is that from a sample of 1074 people, there was a probability of 0.27% to get the data you got if anyone was puting 4 or 5 stars _less_ than 85% of the time (by the way, this is how you got the 99.7% "that only Bayesian gives you", supposedly....). This is the frequentist approach, and it deals with facts and makes two assumptions: independance of choices from users and validity of CLT. Then from that p value, you can do what you want, you are not even obliged to do anything, because so far you only collected data and did maths. Once the computation is done, you can _finally_ go philosophical and decide you do not live in a universe where you got unlucky to be in the 0.27%.
There is no binomial, no beta, no prior, no "I don't have an idea of my prior, so I'll use uniform distribution but I will call it Beta(1,1)", no some god of philosophy told me that "no idea" meant the existence of a uniform distribution in the realm of ideas, etc...
Frequentist works with facts and try, at least when they're not psychologists or marketers, to be rigorous, not forgetting the assumptions they made. They uses stats to falsify theories, and they don't put probabilities on theories which rermain true or false. Bayesians do decision making, using a tool that always works, always getting an answer whatever was the question they had. It is very good for investors who want to use some maths and have a magical tool that allow them to propose a strategy with some appearance of seriousness, and it will work whenever they were lucky with their priors. But at the end of the day, either your posterior "probability" depends a lot on your priors, and you only put a number on your feelings, or it doesn't and you didn't need to go Bayesian. Frequentists don't deal with philosophy. Bayesian do and must.
wow
It's a shame this isn't higher but that's probably to be expected in a channel/comment section so heavily biased to one approach. The number one suggested video to follow this is literally called "the better way to do statistics"
His entire interpretation of the "frequentist perspective" was purposefully limited and he tried to divorce it from reality and naturally occurring events. I'd go as far and argue that his interpretation of how to report a confidence interval was bordering on incorrect. It can, and should be, phrased practically identically to the way he talked about credibility intervals later. The entire point is that you can't know something perfectly to arbitrary confidence and the estimation of true probability can only be refined. A confidence interval is the way of quantifying this spread of uncertainty. He even contradicted himself on the definition of "repeated experiment". First he defines experiments as events that produce individual data points and then he's purposely obtuse and redefines repeating the experiment to gather another 1074 reviews.
Really should have partnered with someone else to present the other side. An entire video with straw men is boring
feel free to make that video with more correct frequentist teachings, more good statistics videos wouldn’t hurt on TH-cam
@@very-normal I think that might hurt my current content algorithm stuff yknow ;)
I haven't seen the video, but the solution is easy, just apply the law of big numbers
I’m going to start calling it that from now on
@very-normal at least in my language, that's the name we call it in school, that for big samples, we get closer to the true value
"Big number then true"
It's kind of a meme
en.m.wikipedia.org/wiki/Law_of_large_numbers
Unfortunately, it has a more rigorous definition
Nvm, you mentioned it, i was a frequentist as a joke by accident
Is that really that the law of large Numbers? Or just the definition of convergence in probability
Doesn’t the law of large numbers say that the sample mean converges in probability to its expected value
Another interesting example is the number of permutations of n distinct elements, such that none of them stays in its original position. The answer happens to be the closest integer to n!/e.
You are offered a pair of loaded dice with an assertion of their 'loading'. Can you believe them, and how much should you pay to test them before buying. Should you start by assuming the dice are unweighted (and the sale is a confidence trick), or that the dice are weighted as offered.
PS the con artist (?) did a single dice throw, to show you, before stating the weighting...
i don’t take dice from strangers