Logistic Regression in R, Clearly Explained!!!!

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 ม.ค. 2025

ความคิดเห็น •

  • @statquest
    @statquest  3 ปีที่แล้ว +28

    Here's the link to the code: github.com/StatQuest/logistic_regression_demo/blob/master/logistic_regression_demo.R
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @falaksingla6242
      @falaksingla6242 2 ปีที่แล้ว

      Hi Josh,
      Love your content. Has helped me to learn a lot & grow. You are doing an awesome work. Please continue to do so.
      Wanted to support you but unfortunately your Paypal link seems to be dysfunctional. Please update it.

  • @MuctaruKabba
    @MuctaruKabba 4 ปีที่แล้ว +42

    Your videos never disappoint, Sir. I have gone through many of them and think you've earned the right to brand the phrase: "clearly explained" because your explanations are indeed very clear. I am building a better explanation of statistics thanks to you. I appreciate you and hope you continue to pass on the knowledge.

    • @statquest
      @statquest  4 ปีที่แล้ว +4

      Wow, thanks!

    • @zhansayabauyrzhanova2492
      @zhansayabauyrzhanova2492 3 หลายเดือนก่อน

      I dont understand why you used both categorical for logistic regression??
      7:00

  • @holeman1
    @holeman1 3 ปีที่แล้ว +28

    This 89-year-old guy says BAM!! So clearly explained, indeed. DOUBLE-BAM!!!!

    • @statquest
      @statquest  3 ปีที่แล้ว +3

      BAM!!! And thank you for your support!!!!

  • @emilyblythe7708
    @emilyblythe7708 6 ปีที่แล้ว +91

    where have you been my whole thesis! thank you!!

    • @statquest
      @statquest  6 ปีที่แล้ว +9

      Hooray! I'm glad to help! :)

    • @amandacampos3037
      @amandacampos3037 4 ปีที่แล้ว +1

      I feel the same!! hah

  • @wei2674
    @wei2674 4 ปีที่แล้ว +26

    Thank you so much Josh for all these videos! I got Aplus for most of my stat courses quite a few years ago when I was doing my MSc of BIostat, but it took me quite some time to come up with a better understanding of a few concepts. You just summarized and presented these ideas and more in a few minutes! You are a genius and on top of that, you are so Kind to share all these work to everyone for free! With my limited vocabulary, all I can say is THANK YOU! It makes me feel the world is a beautiful place with beautiful mind and soul. I love your song “hello”, it reminds me of the day I met my daughter and brought happy tears to my eyes :)

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Thank you so much!!! I'm really glad you like my videos and my music. :)

  • @chasti5754
    @chasti5754 3 ปีที่แล้ว +13

    I just wish one day all this information actually stays and sticks to my mind... thank you thought! Your videos are amazing!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Thanks for watching!

  • @alexandergeorgiev2631
    @alexandergeorgiev2631 4 ปีที่แล้ว +2

    You are an absolute life saver. My data science paper is due in two days and now I have my pretty log graph and I understand this better. DOUBLE BAM!!!!!

  • @SurrenderPink
    @SurrenderPink 4 ปีที่แล้ว +5

    Josh, it’s Saturday morning here and I’m enjoying a cup of Bam! learning R from the best teacher on the planet. I’m so grateful and appreciative of your efforts to share your considerable talents with us!

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you very much! :)

  • @meniz4659
    @meniz4659 4 ปีที่แล้ว +16

    You will surely be in my Thesis acknowledgments. Thank you for making our lives relatively easier but truly more ineligible. BAAAAAM!!

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thanks so much! :)

  • @nathanielchristian7027
    @nathanielchristian7027 5 ปีที่แล้ว +3

    Your simple English explanation of the meaning of "Intercept" in the output from 8:30 to 8:38 of this video was something I could not find after searching for 2 hours. Thank you!

    • @statquest
      @statquest  5 ปีที่แล้ว +2

      Awesome!!! Now that you have that concept down, a lot of other stuff in statistics should make more sense. (At least I hope!) :)

  • @solalstenou6474
    @solalstenou6474 6 ปีที่แล้ว +1

    What is great with your video is that even if I forgot my headphone I am able to follow the video in the computer room full of other students! Thank you so so so much !!!! From University of Bordeaux

    • @statquest
      @statquest  6 ปีที่แล้ว

      Solal Sténou Merci!! :)

  • @marielledelcarmencaballero5017
    @marielledelcarmencaballero5017 2 ปีที่แล้ว +1

    Your videos are great! It's also so nice of you that you take the time reply to so many of the comments here !

  • @daviddevega4433
    @daviddevega4433 4 ปีที่แล้ว +2

    Thanks you very much for all stuff. You have saved me to fail my exams. Amazing quality channel Unbelievable the low number of likes. Very appreciated channel, at least for me. Thanks again.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Wow, thanks!

  • @wei2674
    @wei2674 4 ปีที่แล้ว +8

    Both my husband and I learned so much from ur video. ( inspired by the top comment), whenever you come to Toronto let us know for a few free accommodation in our Asian restaurant/bubble tea surrounded neighborhoods (north York center)!
    Thx again!
    Xin

    • @statquest
      @statquest  4 ปีที่แล้ว +2

      Hooray!!! That would be awesome. I will dream of the day I can visit you in Toronto. :)

  • @wa5561
    @wa5561 2 ปีที่แล้ว +1

    Thank you for saving my study. Not gonna lie, this video made me cry. I was about to drop out because of statistics, but this saved my project.

  • @BruceWayne-oc7dn
    @BruceWayne-oc7dn 3 ปีที่แล้ว +1

    Its's 1:11 AM and what I am doing is DOUBLE BAM. Thank you for this awesome video. U are hero.

  • @dodgecarlincila879
    @dodgecarlincila879 3 ปีที่แล้ว +3

    I was just here for the logistic regression but bam!! I would be watching all of your videos. As a ds learner using r, double bam!!!, your videos will surely help big time! Bambambam! 👌😅
    Thank you. 🙂

    • @statquest
      @statquest  3 ปีที่แล้ว

      Awesome! Thank you!

  • @alhaque7556
    @alhaque7556 2 ปีที่แล้ว +1

    Thank you so much! I've a stat project to do in R with logistic Regression and this simplified the coding portion so much!

  • @Fsp01
    @Fsp01 3 ปีที่แล้ว +2

    Doing a masters program on analytics and this video made more sense than all the lectures combined on logistic regression. thank you

  • @565-FENRIR
    @565-FENRIR 2 ปีที่แล้ว +2

    I really enjoyed the clearly way to explain us this topic. So many thanks for the teaching!!!

    • @statquest
      @statquest  2 ปีที่แล้ว

      Thank you very much!!!

  • @japhethernandezvaquero204
    @japhethernandezvaquero204 4 ปีที่แล้ว +3

    Nice channel to land on! Happiest discovery of my 2020! Great job!

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @zahraab1027
    @zahraab1027 4 ปีที่แล้ว +5

    "one last shameless self promotion" got me 😂😂😂.....that's why I love your videos, u make learning stats fun

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Hooray! Thank you! :)

  • @i8thelastmoa360
    @i8thelastmoa360 5 ปีที่แล้ว +5

    Your videos cover everything in my course and I wish I found you sooner! So much detail and clear explaining in such little time

  • @burrohq
    @burrohq 3 ปีที่แล้ว +1

    You sir deserve a promotion 👏 thanks for this incredibly helpful video

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @nl7247
    @nl7247 ปีที่แล้ว +1

    Thanks for also showing how to wrangle data and explore missing data in a simple helpful way ❤

    • @statquest
      @statquest  ปีที่แล้ว +1

      My pleasure 😊

    • @zhansayabauyrzhanova2492
      @zhansayabauyrzhanova2492 3 หลายเดือนก่อน

      I dont understand why you used both categorical for logistic regression??
      7:00

    • @nl7247
      @nl7247 3 หลายเดือนก่อน

      The outcome is dichotomous.

  • @maheshkumar-vv5fp
    @maheshkumar-vv5fp 4 ปีที่แล้ว +2

    good looking white background...
    graphs are beautiful...
    whatever you say, you write it on screen....
    your sound and sound system, very good..
    the way you explain things, CLEARLY EXPLAINS everything..
    and loved that music part and BAM!!!
    and here, i have something to say about your work..
    and that is VERY BIG BAM !!!... good luck.. keep growing..

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you very much! :)

  • @chrischukwu2956
    @chrischukwu2956 4 ปีที่แล้ว +3

    You are an amazing teacher. God bless you!

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! 😃

  • @Eldad_2.0
    @Eldad_2.0 4 ปีที่แล้ว +4

    Great job bro.
    Gratitude for your help. You also have where to stay if you come to Uganda (Africa).

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you very much!!! :)

  • @farhadwaseel9981
    @farhadwaseel9981 5 ปีที่แล้ว +6

    I recommend all the videos by stat quest with Josh Starmer. Thank you for your good explanations.

    • @statquest
      @statquest  5 ปีที่แล้ว

      Thank you very much! :)

  • @yashilagovender5134
    @yashilagovender5134 3 ปีที่แล้ว +1

    Thank you so much for this video! I've been suffering with the coding for my project but this really helped. You're a star!

  • @goodsuggestionbutno6783
    @goodsuggestionbutno6783 3 ปีที่แล้ว

    Hoooray! We made it to the end of an exciting journey through logistic regression! Hope you have a nice day, and thank you for understanding the output for logistic regression in R, which really cant be understood thoroughly without watching all the logistic + odds videos!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Yep, that is correct. That's why I made all those other videos first - the output is jam packed with stuff.

  • @yutassmilehealsme6572
    @yutassmilehealsme6572 4 ปีที่แล้ว +2

    THANK YOU! somehow I couldn't find any websites explaining this

    • @statquest
      @statquest  4 ปีที่แล้ว

      Glad you found it.

  • @mariyapak428
    @mariyapak428 3 ปีที่แล้ว +1

    Josh, joining all the folks here in thanking you! I have a question: around minute 9:05 you talk about odds of having being unhealthy for a female. How do we know that these are the odds of being unhealthy vs being healthy? I feel I am floating when it comes to intercept, reference categories, and baseline categories. Thanks a lot!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      R orders factors ("healthy" vs "unhealthy") in alphabetical order. So that means "healthy" is first, and the default, and "unhealthy" is the difference from that. Likewise, "sexF" and "sexM" are ordered alphabetically, so "sexF" is the default value and "sexM" is the difference from that.

  • @temjim
    @temjim 4 ปีที่แล้ว +5

    Hi, Josh. I cannot thank you enough for these videos... Would also be good to have a similar video in Python..

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Great suggestion!

    • @aishwaryadas3681
      @aishwaryadas3681 2 ปีที่แล้ว

      @@statquest where's the video sir in python sir?

  • @jives.
    @jives. 3 ปีที่แล้ว +1

    lets goooo StatQuest

  • @Mel22Brasil
    @Mel22Brasil 4 ปีที่แล้ว +2

    It must be so much fun working with you! Thank you for this tutorial. =)

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @ricardot4722
    @ricardot4722 5 ปีที่แล้ว +2

    I am impressed, you are talented, thanks for your sharing your knowledge.

    • @statquest
      @statquest  5 ปีที่แล้ว

      Thank you! :)

  • @LoizidesGeorge
    @LoizidesGeorge 5 ปีที่แล้ว +22

    So helpful, thanks!
    Whenever you come to Cyprus let me know for few free accomodations in our mountainous region, Marathasa!
    Thx again!
    Γ

    • @statquest
      @statquest  5 ปีที่แล้ว +7

      Wow! That sounds awesome!!!

    • @LoizidesGeorge
      @LoizidesGeorge 5 ปีที่แล้ว +3

      @@statquest
      oh yes!
      I owe you a lot - you saved me so many hours!
      Γ

  • @ThinkwithLex
    @ThinkwithLex ปีที่แล้ว +2

    A small request, you have done a lot already, a big thank you for that. Is it possible to make a video on Logistic regression in Python ?

    • @statquest
      @statquest  ปีที่แล้ว +1

      I'll keep that in mind.

    • @ThinkwithLex
      @ThinkwithLex ปีที่แล้ว +1

      @@statquest thank you so much

  • @critiquessanscomplaisance8353
    @critiquessanscomplaisance8353 5 ปีที่แล้ว +3

    I won't forget you in the acknowledgments sir haha!!! Great job!

    • @statquest
      @statquest  5 ปีที่แล้ว

      Thank you very much! :)

  • @raghavendral882
    @raghavendral882 5 ปีที่แล้ว +2

    BAM_ spot on thanks for such video.. my journey with logis tic regression and r has started.

  • @hajer3335
    @hajer3335 6 ปีที่แล้ว +2

    Thank you so much for this effort really appreciate
    We need a stat quest on three topics:
    1-Chi-square test,
    2- The Hosmer-Lemeshow goodness of fit test for logistic regression.
    And 3- Iteratively reweighted least squares (IRLS) by using Newton's method.
    If you don't mind :) of course.
    Can you tell us about the title of next video?!

    • @statquest
      @statquest  6 ปีที่แล้ว +1

      The Chi-Square test is on the list. I've looked into the Hosmer-Lemeshow fit... Can you tell me what you think about the limitations? Specifically those mentioned in the wikipiedia article about it? en.wikipedia.org/wiki/Hosmer%E2%80%93Lemeshow_test#Limitations_and_alternatives
      And iteratively reweighted least squares is also on the list. However, up next are some basic statistics videos and then videos on lasso, ridge, and elastic-net regression.

    • @hajer3335
      @hajer3335 6 ปีที่แล้ว +1

      the Hosmer-Lemeshow statistic was used to avoid problem in Pearson chi-squared statistic which was when observations being grouped by the values of the x variables, the Pearson chi-squared goodness of fit test cannot be readily applied if there are only one or a few observations for each possible value of an x variable, or for each possible combination of values of x variables.
      (A sample with a sufficiently large size is assumed. If a chi-squared test is conducted on a sample with a smaller size, then the chi-squared test will yield an inaccurate inference).
      So in the Hosmer-Lemeshow statistic, the observations are grouped by expected probability. But there is very little guidance on selecting the number of subgroups. The number of subgroups,g, is usually calculated using the formula g> P + 1. For example, if you had 12 covariates in your model, then g > 12. How much bigger than 12 g should be is essentially left up to you. Small values for g give the test less opportunity to find mis-specifications. Larger values mean that the number of items in each subgroup may be too small to find differences between observed and expected values. Sometimes changing g by very small amounts (e.g. by 1 or 2) can result in wild changes in p-values. As such, the selection for g is often confusing and arbitrary. Also, it doesn’t take overfitting into account and tends to have low power. For these reasons, the Hosmer-Lemeshow test is no longer recommended.
      Am I on right? Is it enough cues to no longer used of HL test?
      I have another question, ( Overfitting is happening when your sample size is too small. If you put enough predictor variables in your regression model, you will nearly always get a model that looks significant.
      While an overfitted model may fit the idiosyncrasies of your data extremely well, it won’t fit additional test samples or the overall population. The model’s p-values, R-Squared and regression coefficients can all be misleading. Basically, you’re asking too much from a small set of data.)
      If I have a small sample, is there any problem to use Maximum likelihood to fit model and McFadden's pseudo-R squared? Is there any rule to chose the number of sample for any regression?
      Sorry for the many of questions, it is my first year in biostatistics. :)

    • @statquest
      @statquest  6 ปีที่แล้ว +1

      These are all great questions. You are correct about the HL test and you are correct about overfitting. There are, however, lots of tricks you can use to compensate for overfitting (lasso regression, ridge regression, elastic net regression etc.)
      One way to test to see if you have a model that is "overfit" is to use cross validation.
      As for a minimum number of samples for logistic regression - people often say "10 samples per level of each discrete variable". It's a general rule of thumb and it doesn't always apply. However, again you can use cross validation to verify if you have enough samples or not. Cross validation is a very practical tool!

    • @hajer3335
      @hajer3335 6 ปีที่แล้ว +1

      Thank you, Mr Josh, for answering me, I need to study more about Cross-validation.

    • @hajer3335
      @hajer3335 6 ปีที่แล้ว +1

      Sorry l have more than one account 🙈🙊

  • @vidyaammu1687
    @vidyaammu1687 3 ปีที่แล้ว

    Thanks for the video. Your video made it look like so simple. I request you to upload a video of how to get risk ratios in multiple logistic regression model.

    • @statquest
      @statquest  3 ปีที่แล้ว

      I'll keep that in mind.

  • @AOLFlyersNewsletters
    @AOLFlyersNewsletters 4 ปีที่แล้ว +1

    Thanks Josh - you are our saviour!

  • @tansutazegul8297
    @tansutazegul8297 2 ปีที่แล้ว +1

    incredibly brilliant tutorial!

  • @riteshpatel1984
    @riteshpatel1984 5 ปีที่แล้ว +3

    Hi Josh, thanks for your videos they are very easy to understand. Really appreciate your efforts. I believe I speak for many,
    Because of you many people are able to understand with utmost clearity and you cover all the small details with super ease. Keep up the Nobel work. Cheers 👍
    Would it be possible for you to put up a video on model evaluation i.e. determining cutoff and model performance.
    Thanks

    • @statquest
      @statquest  5 ปีที่แล้ว +1

      Thank you! :)

  • @kingfisher65
    @kingfisher65 ปีที่แล้ว +2

    amazing. thank you man!

    • @statquest
      @statquest  ปีที่แล้ว

      Thanks!

    • @familians
      @familians ปีที่แล้ว

      You may like this video too:
      Another great video about logistic regression in JMP
      th-cam.com/video/9yN_yjGAJZE/w-d-xo.htmlsi=jUwEZUDobBudE8AE

  • @internalmedicine9982
    @internalmedicine9982 4 หลายเดือนก่อน +1

    Thanks for an excellent video. As usual.

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thanks again!

  • @RajeshSahu-ey8kw
    @RajeshSahu-ey8kw 4 ปีที่แล้ว +1

    U are geneus...and ur teaching style too...hurray!!!! and Bamm!!!!

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Wow, thank you!

  • @bellahuang8522
    @bellahuang8522 3 ปีที่แล้ว +1

    me binge watching Josh's videos before midterm... anyone else? lmao

    • @statquest
      @statquest  3 ปีที่แล้ว

      Good luck! :)

  • @kevinanderson170
    @kevinanderson170 3 ปีที่แล้ว +1

    This is great stuff as I am just learning R; so pardon a very basic question: Why does "sex" need to be a factor vs number here?

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Since the values or 0 and 1, it probably doesn't matter. However, to be safe, it's probably a good idea to make all categorical values, regardless of their values, factors.

  • @woopwoopsoupsoup678
    @woopwoopsoupsoup678 2 ปีที่แล้ว +1

    This man is a legend

  • @Actanonverba01
    @Actanonverba01 5 ปีที่แล้ว +1

    At 11:30, the video states, "Since we are not estimating the variance from the data (and instead deriving it from the mean) it is possible that the variance is UNDERESTIMATED." Q. How can we say that we are UNDER-estimating the value of the variance? BTW, awesome vids, music man! ;)

    • @statquest
      @statquest  5 ปีที่แล้ว +1

      That's a great question. Here's a (hopefully) useful discussion on the topic: newonlinecourses.science.psu.edu/stat504/node/162/

    • @Actanonverba01
      @Actanonverba01 5 ปีที่แล้ว

      @@statquest
      It's 1am but let me see if I got this straight...
      Due to the nature of discrete functions (like logistic functions) they do not always vary smoothly. With discrete functions it is possible to see variances (and their corresponding probabilities) differ from (in our case) the proposed logistic model. In other words, it is conceivable to have a leptokutic or platykurtic distribution.
      It is possible to see probabilities which differ from the expected probabilities due to the fact that the "real" model may be different and/or the samples may not be i.i.d. As it happens, the Bernoulli distribution tends toward the platykurtic.
      ...It's just the wrong dang model sometimes...

    • @statquest
      @statquest  5 ปีที่แล้ว +1

      It's actually a little simpler than that. With binomial data (like logistic regression) we estimate the mean value = number of positive responses / total number of responses. Once we have the mean value estimated, we use that, and that alone, to calculate the variance. In other words, once we have calculated the mean, we do not need the data anymore to calculate the variance. This is in contrast to linear regression (or a lot of other things) where we estimate the mean with the data and then use the data again to calculate how it varies around the estimated mean. Thus, there is a possibility that with Logistic Regression (and other "generalized linear models") we did not correctly estimate the variance since the data were not involved in that calculation. If we over estimate the variance, that just makes the calculations more conservative and, generally speaking, that's not a problem. However, if we underestimate the variance, then that means we're more likely to say things are significantly different even if they are not, and that's no good. So the dispersion parameter takes care of that.

    • @Actanonverba01
      @Actanonverba01 5 ปีที่แล้ว +1

      @@statquest Cheers,

  • @at4652
    @at4652 6 ปีที่แล้ว +5

    Great tutorials, I started with your PCA video and since then hooked onto other videos . Could I request you to do a video on various types of probability distributions when to use them.

    • @statquest
      @statquest  6 ปีที่แล้ว +2

      Those are all in the works. I wish I could work 2 or 4 times faster than I can. I've wanted to cover the major probability distributions for over a year, but got sucked down a machine learning path and now feel spread pretty thin. However, these will happen eventually! :)

    • @TimothyChenAllen
      @TimothyChenAllen 6 ปีที่แล้ว +1

      StatQuest with Josh Starmer could you make a video on how to work 2 to 4 times faster? :-)

    • @statquest
      @statquest  6 ปีที่แล้ว +1

      As soon as I figure that out, I'll make a video on it! ;)

    • @weilianglim1764
      @weilianglim1764 6 ปีที่แล้ว

      BAM!!!

  • @williamstan1780
    @williamstan1780 2 ปีที่แล้ว

    I like the way you presented the information in such a manner that is easily understood
    I have 2 questions
    1. While doing the xtab, what do we need to do if we found that say it is either or both healthy and unhealthy under cp3 is 0 or very minimal (video clip at 6:04)
    2. At 15:55 of the video clip , you mentioned about using cross validation to get a better idea of how well it might perform with new data. Do you have a separated video which is specifically for that topic ?
    Many thanks
    Williams

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      1) Unfortunately I don't understand what you're asking in this question. However, I think you are asking what do we do when one level from a categorical variable does not have strong preference for healthy or unhealthy or doesn't have much data to begin with. It really depends. You can just try it and see what happens, but you might also try removing the variable and see if that improves predictions.
      2) I have a video on cross validation here: th-cam.com/video/fSytzGwwBVw/w-d-xo.html

    • @williamstan1780
      @williamstan1780 2 ปีที่แล้ว

      @@statquest
      Thanks for your prompt reply
      let me clarify my 1st question at the video clip at 6:24, you mentioned that there are 4 patients represent level 1 under restecg category.
      My first question is, why only 4 can cause problem? is it because it is too mininal compares with others?(Level 0, and Level 2). How do I know exactly that it is causing the problem when I do the analysis? and if it does cause the problem. how to go about fixing it? just remove the Level 1 can solve the problem?
      Thanks for your help

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@williamstan1780 When you don't have much data supporting a specific category, then chance are it will have a lot of variance - in other words, further samples may be very different from the ones in the original dataset. You can test this with cross validation (use some of the data to fit the model, use the rest to see how well it performs). If things are no good, you can remove the variable, or try to lump categories together.

    • @williamstan1780
      @williamstan1780 2 ปีที่แล้ว +1

      @@statquest thanks Josh ..: appreciated

  • @sofiaalfonso9883
    @sofiaalfonso9883 3 ปีที่แล้ว +1

    Sir, you are a savior

  • @KayYesYouTuber
    @KayYesYouTuber 4 ปีที่แล้ว +1

    Your videos are awesome. Thank you very much.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @ca177
    @ca177 4 ปีที่แล้ว +2

    YOU RAWK !! Awesome explains on ML concepts..

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @geetikapanda7152
    @geetikapanda7152 4 ปีที่แล้ว +1

    The more I watch your videos the more the wish I had a teacher like you in my school days..
    Do we have a video on chi square test?

    • @statquest
      @statquest  4 ปีที่แล้ว

      Not yet. :( But one day we will.

  • @mueezwaq
    @mueezwaq ปีที่แล้ว +1

    Hi Josh
    Firstly many thanks for your videos on this topic. I have noticed very odd and conflicting results between R and SPSS with regards to entering factors (with more than 1 level) into a logistic regression model. SPSS produces a simplified output containing an odds ratio with 95% CI and p-value, for each individual variable entered into a logistic regression model (rather than the factor levels, as displayed in R).
    In R - I have not found a good way to do this. I have used the logistic.display command as well as exp() to get odds ratios, but they do not provide an overall value like in SPSS (instead, listing these for each individual level within the factor).
    Do you have any idea why SPSS and R handle logistic regression differently like this? All I would like is a similar output to SPSS - where I get a single odds ratio, 95% CI and p-value for each individual factor variable entered.

    • @statquest
      @statquest  ปีที่แล้ว +1

      Unfortunately I've never used SPSS so I'm not really familiar with the problem you are having. That said, perhaps this will help: stats.stackexchange.com/questions/543540/different-output-for-logistic-regression-between-r-and-spss-how-to-get-correct

    • @familians
      @familians ปีที่แล้ว

      Hi!! You may like this video too:
      Another great video about logistic regression in JMP
      th-cam.com/video/9yN_yjGAJZE/w-d-xo.htmlsi=jUwEZUDobBudE8AE

  • @guhanathanprathish9704
    @guhanathanprathish9704 4 ปีที่แล้ว

    kindly do logistic Regression in python from scratch.. Your way of teaching and explanation is amazing.. keep rockzz❤️

    • @statquest
      @statquest  4 ปีที่แล้ว

      I'll keep that in mind.

  • @paulshannon9708
    @paulshannon9708 5 ปีที่แล้ว

    You really are wonderful for explaining this in a way morons like me can understand, this is so incredibly helpful. Thank you so much!

  • @sitendurocks
    @sitendurocks 4 ปีที่แล้ว

    at the end where you make the graph , you could have used the broom package and augment function to create the data frame to compute the fitted and actual values.

  • @christelleleitzingerphd7491
    @christelleleitzingerphd7491 4 ปีที่แล้ว +1

    Awesome! Thank you so much! Please could you do a video about conditional logistic regression like clogit in R with result interpretation and how it works when using adjusted parameters.

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      I'll keep that in mind.

  • @YuTubering
    @YuTubering 4 ปีที่แล้ว +1

    At 3:04, why do you have to convert the column of Strings to an Integer before a factor? Why not just convert it straight to a factor?

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      I should have been more clear. Unless you tell R not to treat strings like factors, it does that by default. So, at that point "ca" is already a factor. The problem is that "?" is considered one of the levels of that factor. We can drop that level, or we can convert it to integers and then back to a factor. For some reason I did it the latter way. It would have been better to just drop the "?" level.

    • @YuTubering
      @YuTubering 4 ปีที่แล้ว +1

      @@statquest thanks!

  • @mutuamutunga
    @mutuamutunga 4 ปีที่แล้ว +2

    This has been extremely helpful. Thank you!

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @fahmiidris4499
    @fahmiidris4499 4 ปีที่แล้ว +2

    super dangg! Good explanation, bro!

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @sean893
    @sean893 4 ปีที่แล้ว

    @StatQuest with Josh Starmer, After plotting the logistic regression graph in 16:47, how to do Confusion matrix?
    I searched Stack overflow and tried various Confusion Matrix codes but no luck, I get errors and couldn't make the matrix.

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      You have to first define a threshold for classification. The default is p > 0.5 gives you one classification and p

  • @Gypsy_Danger_TMC
    @Gypsy_Danger_TMC 2 ปีที่แล้ว

    OK.. I haven't even watched the video yet but it looks like exactly what I need

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      I hope so! :)

    • @Gypsy_Danger_TMC
      @Gypsy_Danger_TMC 2 ปีที่แล้ว

      @@statquest I'm trying to uze a logistic regression model on a set of binary events. Each with a different probability of happening.. and I have no idea what I'm doing haha.. so I'm loading up on coffee and I'm going to start your videos soon

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@Gypsy_Danger_TMC Good luck! :)

  • @yutassmilehealsme6572
    @yutassmilehealsme6572 4 ปีที่แล้ว

    Hi, can we use a chi squared test first on (in this example), heart disease and sex then use a glm model on those 2 variables? I tried it with my data and the p values were different although both were significant. Afterwards I modelled all the other independent variables to check for confounding. I found some of them were significant along with the same variable in the chi test I did.
    In 13:47, since the output only says sexM, does that mean the P value 0.002503 only accounts for males?

    • @statquest
      @statquest  4 ปีที่แล้ว

      The chi-square test and Wald test (used for logistic regression) are related are fundamentally different tests so it doesn't surprise me that you got different p-values, but because both tests are intended to work with the same data, it doesn't surprise me that they both were significant.
      As for "sexM" at 13:47, that means that both labels, "male" or "female", are useful for predicting heart disease. The reason it is called "sexM" is that "males" get a '1' for that variable, meaning that it is true that they are males, and females get '0', meaning that it is not true that they are males. For more details on how design matrices work, see: th-cam.com/video/CqLGvwi-5Pc/w-d-xo.html

    • @yutassmilehealsme6572
      @yutassmilehealsme6572 4 ปีที่แล้ว +1

      @@statquest Thanks for the prompt reply! This vid saved my grade for my stats unit.

    • @statquest
      @statquest  4 ปีที่แล้ว

      @@yutassmilehealsme6572 BAM!

  • @danielromero-alvarez5392
    @danielromero-alvarez5392 4 ปีที่แล้ว +1

    you are just the best! Thanks for doing this!

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Thank you! :)

  • @N0o0x0e0r
    @N0o0x0e0r 6 ปีที่แล้ว +1

    This channel has helped me a lot understanding statistics! Could you please make a video explaining the linear mixed model too?

    • @statquest
      @statquest  6 ปีที่แล้ว

      Yes! However, it might be a while before I get to it.

  • @timding6241
    @timding6241 2 ปีที่แล้ว

    Hi Josh, thank you very much for your teaching, I have really learned a lot over the past few days.
    May I ask a (stupid) question...At 8:45, you mention that you predict heart disease for a female patient. However, how can I tell whether it means health or unhealthy? I remember you set the hd variable as 2 level: health and unhealthy. How do I know which level we are predicting?
    In the github page, you tell us that: "## The intercept is the log(odds) a female will be unhealthy. This is because female is the first factor in "sex" (the factors are ordered,
    ## alphabetically by default,"female", "male")".... Does this alphabetical order also apply to "healthy" and "unhealthy"? I am not sure why the intercept is not predicting "healthy" female patient because "H" (i.e., healthy) comes before "U" (i.e., unhealthy). This part really confuses me. Thank you!

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      At 3:17 we code "healthy" as 0 and "unhealthy" as 1. Thus, the "base" or default is "Healthy" (since 0 comes before 1) and we are predicting the log(odds) that someone is different from the base.

    • @timding6241
      @timding6241 2 ปีที่แล้ว +1

      @@statquest Thank you for much for your help.....It is very clear!!!

  • @afiapriscilla8276
    @afiapriscilla8276 ปีที่แล้ว +1

    Is it needed to turn all the variables into a factor before the regression analysis?

    • @statquest
      @statquest  ปีที่แล้ว +1

      All of the categorical variables need to be converted to factors.

    • @afiapriscilla8276
      @afiapriscilla8276 ปีที่แล้ว

      @@statquest Thank you very much. What do you classify as categorical?

    • @statquest
      @statquest  ปีที่แล้ว

      @@afiapriscilla8276 Variables that represent discrete categories. Like "favorite color=Blue" or "Red"

  • @dchristiadi85
    @dchristiadi85 5 ปีที่แล้ว

    Hi Josh,
    Firstly, forgive my ignorance. Can I refer to null and proposed LL results in 14:12? You mentioned that to pull from the log-likelihood, we need to divide the scores by -2. Can you please elaborate how do you get the -2?
    Additionally, in 14:46 do you use 1-pchisq to get the upper tail? If my assumption is wrong, can you please explain the 1-pchisq part?
    Thank you heaps

    • @statquest
      @statquest  5 ปีที่แล้ว

      Your first question is answered in my video on Saturated Models and Deviance: th-cam.com/video/9T0wlKdew6I/w-d-xo.html
      For your second question, the answer is "you are correct!". :)

  • @Han-ve8uh
    @Han-ve8uh 4 ปีที่แล้ว

    Is there a video explaining the 3 points on right of slide at 11:30? I watched all of the linear/logistic videos and my understanding is these regressions fit a line to predict data given new x. I don't understand what has this got to do with estimating mean/var. Is estimating mean/var part of the fitting process and important to creating the model?

    • @statquest
      @statquest  4 ปีที่แล้ว

      Unfortunately this is not something that I talk about in other videos because it is rarely adjusted for logistic regression.

  • @tvvt005
    @tvvt005 6 วันที่ผ่านมา

    14:37 is this the effect of all those variables on the response of just the significant ones? When building a linear equation out of this , we need to only include the variables with low p values on the RHS and can omit the others as they do not have as much impact right?
    16:36 is it possible to get avoid getting differing number of rows for the number of fitted values vs response…it keeps appearing like that for me

    • @statquest
      @statquest  5 วันที่ผ่านมา

      1) It's the effect of all of the variables.
      2) I'm not sure I understand your second question. Are you asking about the call to ggplot()?

    • @tvvt005
      @tvvt005 5 วันที่ผ่านมา

      @statquest thank you so much. It's clear now.Is it possible for a model to have low pseudo R² but high accuracy?

    • @statquest
      @statquest  5 วันที่ผ่านมา

      @@tvvt005 I would find that odd.

    • @tvvt005
      @tvvt005 5 วันที่ผ่านมา

      @@statquest I see… thank you.., I’ll go try to find where I went wrong… 😓

  • @marcelomurilloquesada8400
    @marcelomurilloquesada8400 5 ปีที่แล้ว +1

    Hi, I really like your videos, every topic is as clear as water after watching it. I've watched this one and also the three videos about logistic regression's details. If you want to go further in this topic, you could do a video explaining emmeans package for R. Many people, including me, would understand post hoc tests for glm using emmeans, if someone like you explained it. Thank you!

  • @davila1906
    @davila1906 4 ปีที่แล้ว +1

    so incredibly helpful and well done. Thank you so much!!

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @skandagurunathanr4795
    @skandagurunathanr4795 5 ปีที่แล้ว

    Great salute! If you can, please post a video on all machine learning models with a large dataset example implementation in r with clear intuition and mathematics statistics behind it. Thanks.

  • @andreatulli356
    @andreatulli356 4 ปีที่แล้ว +1

    Great video!!! Thank you so much!

  • @thomasdrissi
    @thomasdrissi 2 ปีที่แล้ว

    Hi Josh,
    Thanks for the really helpful video! Referring to the clip at 15:20. Whilst I know plotting a Predicted Y for a range of X values (say in a simple univariate logistic regression) we would expect to see that S shape. But for a multiple variable regression (as in yours) should the index of probabilities when ranked and plotted as you've done here always have to be in that S Logistic Shape?
    I am getting more of an exponential curve between 0 and 1, and can't tell if this means I have done something wrong/have something wrong with my model?

    • @statquest
      @statquest  2 ปีที่แล้ว

      Hmm...I'm not sure. Your graph should definitely taper off as the predicted probabilities get closer to 1, but how visible this tapering is might depend on how many data points you plot.

  • @MB-nc9rq
    @MB-nc9rq 3 ปีที่แล้ว +1

    Great video, thanks so much Josh! After the 4th minute you mention how to address the NA samples. Can you teach us the RANDOM FOREST method, if we don't want to get rid of our NA samples (e.g. in multivariate cases, where the rows include other useful info)? Thanks!

    • @statquest
      @statquest  3 ปีที่แล้ว

      I cover the random forest method in this video: th-cam.com/video/6EXPYzbfLCE/w-d-xo.html (the theory is here: th-cam.com/video/sQ870aTKqiM/w-d-xo.html )

  • @mohamedhijazi8460
    @mohamedhijazi8460 4 ปีที่แล้ว +2

    You're the man! thanks for everything!

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you very much! :)

  • @iselacr5747
    @iselacr5747 3 ปีที่แล้ว

    Hi, I love the way you explain all this things! I have a couple of questions. I observe that it's necessary to establish a code type for the predictors, if these are dichotomous, for example, they are assigned 1 and 0 (in the example male / female), so:
    - How should we proceed with polytomous predictors?
    - What results of the model should be reported in a scientific article?
    Thank you in advice and keep doing great content!

    • @statquest
      @statquest  3 ปีที่แล้ว

      1) For all categorical data (with 2 or more classes), just make sure you are storing it in a factor.
      2) That depends on the journal. I would look at other articles in that journal to figure it out.

  • @danieltrodler4340
    @danieltrodler4340 4 ปีที่แล้ว +1

    Great content and incredible value. Thank you so much

  • @namelessbecky
    @namelessbecky 4 หลายเดือนก่อน

    Can you explain to me why do we convert the column into a factor at 2:39?

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Because we're using that variable as a factor instead of a numeric value.

  • @mdhasibreza5161
    @mdhasibreza5161 3 ปีที่แล้ว

    All of your videos are great and fun to learn from! Could you please upload a tutorial on mediation analysis using STATA and R (using the mediation package)?

    • @statquest
      @statquest  3 ปีที่แล้ว

      I'll keep that in mind.

  • @da2015
    @da2015 5 ปีที่แล้ว +6

    These videos are so amazing!
    Do you have a suggestion for a book that explains Logistic Regression to newbies? The videos are super awesome, but extra references may help too. Hopefully you will write your own book soon!
    Thanks!

    • @shnibbydwhale
      @shnibbydwhale 4 ปีที่แล้ว +5

      I know this is probably 10 months too late, but the book “Introduction to Categorical Data Analysis” by Alan Agresti is a great book. Does a really good job explaining logistic regression and is pretty light on the math.

  • @joseluismanzanares3662
    @joseluismanzanares3662 5 ปีที่แล้ว

    Clear as water. Super BAM!!! Gracias por compartir

  • @danee593
    @danee593 5 ปีที่แล้ว +2

    Josh you are amazing, thank you!

  • @JinXing-j1l
    @JinXing-j1l 2 หลายเดือนก่อน +1

    The last graph deserves a quadruple BAM!!!!🤣🤣🤣🤣

    • @statquest
      @statquest  2 หลายเดือนก่อน

      Yes!

  • @katere89
    @katere89 5 ปีที่แล้ว +6

    Hi Josh, thanks for this amazing tutorial. Would you be able to add something interactions between predictors and random effects? I am trying to run a mixed-model logistic regression and have three-way interactions but not entirely sure on how to deal with them. Thanks so much :)

  • @dvijeniya
    @dvijeniya 5 ปีที่แล้ว +1

    Thanks for the detailed and super easy explanation, Josh. I'd like to ask you, shouldn't we check the below items before regression?
    1. Would it be better if we use the WOE or log(odd) of a variable rather than raw variable (for example gender -> dummy). If I'm not wrong to use a dummy variable in the model is not a good choice;
    2. Correlation between variables;
    3. Factor transformation ;
    4. PCA analysis.
    And as a result in order to calculate the probability from log(odds), we should use Sigmoid function? I mean transform log(odds) to the probability
    Thanks in advance!

    • @statquest
      @statquest  5 ปีที่แล้ว +1

      If you want your model to be interpretable - in that you can look at the parameter value and make sense out of them - then removing correlated variables is a good idea and factor transformation and PCA can help with that. On the other hand, if you want to use your model to make the best predictions, then correlated variables are fine and can improve predictions.
      If you are interested in the details behind Logistic Regression, then check out these other StatQuests:
      General Overview: th-cam.com/video/yIYKR4sgzI8/w-d-xo.html
      Interpreting Coefficients: th-cam.com/video/vN5cNN2-HWE/w-d-xo.html
      Fitting the Model to Data with Maximum Likelihood: th-cam.com/video/BfKanl1aSG0/w-d-xo.html
      Calculating R-squared and its p-value: th-cam.com/video/xxFYro8QuXA/w-d-xo.html

  • @baruchschwartz819
    @baruchschwartz819 4 ปีที่แล้ว

    at 6:00, would there be an easy way to program a loop so that R could provide you all those xtabs with one line of code?

    • @statquest
      @statquest  4 ปีที่แล้ว

      I'm sure there is. Maybe someone else will contribute the code.

  • @thomasbaker26
    @thomasbaker26 4 ปีที่แล้ว

    Excellent video, very clear and easy to follow! Do you have any videos that show how to do best subsets and cross validation with logistic regression on R? I know you have a video that explains the concept of cross validation but I am looking for a video like this that goes through it step-by-step for logistic regression on R. Same thing for how to run all possible models (best subsets) using logistic regression on R. I have found one by another youtuber for linear regression but not for logistic.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Not yet. :(

    • @thomasbaker26
      @thomasbaker26 4 ปีที่แล้ว +1

      @@statquest Wow thank you for the quick reply! That's alright, if you do make any videos like that, I'll be among the first to watch them! :)

  • @arpitsrivastava7559
    @arpitsrivastava7559 6 ปีที่แล้ว +3

    This Video is very helpful. Do you also have a video about Multinomial Logistic Regression in R. Could be very helpful if you can post it.

    • @statquest
      @statquest  6 ปีที่แล้ว

      I'm glad you like the video. I don't have one on multinomial logistic regression, so I'll put it on the to-do list.

    • @joshuabudi4787
      @joshuabudi4787 4 ปีที่แล้ว

      @@statquest hello! did you ever make a video for this one? would love to check it out if you did, thanks so much for what you do!

    • @statquest
      @statquest  4 ปีที่แล้ว

      @@joshuabudi4787 Not yet. :(

  • @JRO_Lyrics
    @JRO_Lyrics 2 ปีที่แล้ว +1

    great
    work done here

  • @ericaleverson9430
    @ericaleverson9430 4 ปีที่แล้ว +1

    You are so good!! Thank you!

  • @wilfredoa.tovarhidalgo9385
    @wilfredoa.tovarhidalgo9385 2 ปีที่แล้ว +1

    Excelent!!!! Thank you very much.