Cross Validation : Data Science Concepts

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 ก.ย. 2024
  • All about the very widely used data science concept called cross validation.

ความคิดเห็น • 72

  • @MyGuitarDoctor
    @MyGuitarDoctor ปีที่แล้ว +24

    This is by far the best explanation on TH-cam.

  • @geoffreyanderson4719
    @geoffreyanderson4719 2 ปีที่แล้ว +11

    Great video. It's a good start. Andrew Ng's courses on Coursera are the state of the art in explaining what to do. I will merely summarize real quick.
    A serious problem with cross validation used too soon as shown on just two data partitions ("training" and "testing"), is that you are left with no unbiased (new, that is, new to the model) data on which to estimate your final model afterward, so your estimate will be too optimistic.
    The second problem, and yes it is a problem for your development process of the machine learning model, is that CV mixes concerns together, so that you are simultaneously improving bias error and variance error. And you need a third partition of new data which never participates in training nor hyperparameter selection nor grid search, to be the unbiased error estimation on your final model. The CV methodology immediately clouds the decisions you need to take next as the model developer if your model still isnt' doing well enough: Should you work on improving your model's bias, or its variance error? And how would you know which one is the problem now (hint: you don't with CV). Should you add regularization? Add more features? Get more labeled examples? Deepen the tree depths, or less, or add more units or more layers to your DNN, or add dropout layers and batchnormalization, and on and on. There may be too many knobs to turn, too many options, which can waste your time. Imagine getting more labeled data when this was never the problem. This is why you may want to prefer to proceed more intelligently in the dev process.
    Also you may begin to assess and address problems with class-wise error disparities --- does your model have higher error only on a few particular classifications of Y but not others? It may be time to use resampling to increase the quantity of examples of minority classes that give the current model a higher error rate, or else weight the loss function terms to balance the total loss better.
    A better practical solution is a sequential methodology, where bias is reduced first, all by itself. Fit your model on an extremely few number of training examples at first. Your bias error is encouragingly low if your model can well fit the super small number of examples from training set alone (not the validation set). If the model can't fit the training set well yet, then you need to do more work on bias error, and do not concern yourself with variance error yet.
    Bias error is basically reduced by getting more and better data features (the "P" in this video), and a higher capacity model (more trees, more layers, more units), and a model architecture better suited to your dataset.
    Also, your dataset may be dirty and need improving -- it's not just the algorithm that makes a machine learning model work well. So you need to evaluate your dataset for problems systematically.
    Variance error by contrast is basically reduced by getting more data examples (the "N" in this video), and more regularization (and a better model architecture for your particular dataset, et al).
    I don't explain it as well as Andrew Ng of course. It takes more space than a bit of comments to communicate it all and well. Ng is by far the best practical explainer of this model dev process in the world. If you want to level up as a model developer, do yourself a big favor. Please go take his course Machine Learning on Coursera (if you are cheap like me) or at Stanford (if you have money and can get over to there), or just look on youtube for copies of his lectures on bias and error. Secondly, Ng's Deep Learning specialization(on Coursera) takes these concepts to an even higher level of effectiveness which being super-clearly explained. Third, Ng's specialization "Machine Learning for Production" goes into the most cutting edge data-focused approach to fixing bias and variance error. Honestly, his material is the best in existence on this topic. Additionally, Jeff Leek at Johns Hopkins also does a great job of explaining the need for an unbiased test set in his school's Coursera specialization Data Science (this should be only seen as secondary source of this material to Ng though).

  • @starostadavid
    @starostadavid 3 ปีที่แล้ว +18

    This is how teachers should teach. Filled whiteboard and just explaining, jumping from one concept to another and simultaneously comparing. Not slowly filling the whiteboard, wasting time. Feels unprepared. Good job!

  • @PF-vn4qz
    @PF-vn4qz 2 ปีที่แล้ว +3

    is there any reference list to details about what are the cross-validation techniques most recommended for time series data??

  • @communicationvast9949
    @communicationvast9949 2 ปีที่แล้ว +6

    Great job explaining, my friend. Very easy to follow and understand when there is a good speaker and communicator. You are an excellent teacher.

  • @teegnas
    @teegnas 4 ปีที่แล้ว +7

    Please make a similar video on bootstrap resampling and how it compare with cross-validation and when to do what

    • @marthalanaveen
      @marthalanaveen 4 ปีที่แล้ว +5

      Bootstrap is a very different sampling procedure compared to cross-validation. In CV, you make samples/subsets without replacement, meaning each observation will be included in only one of the samples, but in bootstrapping, one observation may(very likely will) be included in more than one sample.
      That said, if you train a model on 4 CV samples/subsets, your model will never have seen the observations that you will test it on, giving you a better estimate of variance of accuracy (or your choice of metric), which you can't be sure of, when trained with bootstrapped samples/subsets of data, since your model may have seen(or even memorised, for worse) the samples you test it on.
      Disclaimer: I am not as big of an expert as ritvik and i am talking in the context of the video.

    • @ritvikmath
      @ritvikmath  3 ปีที่แล้ว +3

      great description! thanks for the reply :)

  • @ziaurrahmanutube
    @ziaurrahmanutube 4 ปีที่แล้ว +6

    Very well explained! Only question I have would be to have a proof - for us geeks - on why the variance is reduced?

    • @simonjorstedt8552
      @simonjorstedt8552 3 ปีที่แล้ว +3

      I don't know if there is a proof (there are probably examples where cross validation does'nt work), but the idea is that all of the models variances will "cancel out".
      While one model might predict larger values than the true distribution, another model is likely to predict values smaller than the true distribution. So when taking the average of the models, the variance cancels out. Ideally...

  • @NickKravitz
    @NickKravitz 4 ปีที่แล้ว +4

    Nice pen tricks. I am waiting for the ensemble model technique video.

  • @SiliconTechAnalytics
    @SiliconTechAnalytics 3 หลายเดือนก่อน +1

    Not only that all the students gets TESTED; They all also get TRAINED! Variance in Accuracy significantly Reduced!! Well done RitVik!!!

  • @CodeEmporium
    @CodeEmporium 4 ปีที่แล้ว +1

    Nice work! Glad i found you

    • @ritvikmath
      @ritvikmath  3 ปีที่แล้ว

      Awesome, thank you!

  • @isaacnewton1545
    @isaacnewton1545 ปีที่แล้ว

    Let say we have 5 models, each has 3 hyperparameters and the number of folds are 5. Then does that mean we have train 5 * 3 *5= 75 models and choose the best one among them?

  • @ranggayogiswara5148
    @ranggayogiswara5148 8 หลายเดือนก่อน

    What is the difference between this and ensemble learning?

  • @-0164-
    @-0164- ปีที่แล้ว +1

    I just cannot thank you enough for explaining ml concepts so well.

  • @sivakumarprasadchebiyyam9444
    @sivakumarprasadchebiyyam9444 11 หลายเดือนก่อน

    Hi its a very good video. Could you plz let me know if cross validation is done on train data or total data?

  • @josht7238
    @josht7238 ปีที่แล้ว

    thankyou so much very helpful!!

  • @tremaineification
    @tremaineification ปีที่แล้ว

    How are you calculating accuracy in every model?

  • @jupiterhaha
    @jupiterhaha 2 ปีที่แล้ว

    This video should be titled K-Fold Cross-Validation! Not Cross Validation! This can be confusing to beginners!

    • @ammarparmr
      @ammarparmr ปีที่แล้ว

      Also,
      Why Rhe instructor used all 1000 in the cross validation
      Should not he leave the test set aside to do fonal check?

  • @zuloo37
    @zuloo37 3 ปีที่แล้ว +2

    I've often used cross-validation to get a better estimate of the true accuracy of a model on unseen data, to aid in model selection, but sometimes for the final model to be used in future predictions, I will train on the entire available training data. How does that compare to using an ensemble of the CV models as the final model?

    • @GohOnLeeds
      @GohOnLeeds 9 หลายเดือนก่อน

      If you train on the entire available training data, how do you know how well it performs?

  • @trongnhantran7330
    @trongnhantran7330 ปีที่แล้ว

    This is helpful 😍😍😍

  • @gtalckmin
    @gtalckmin 3 ปีที่แล้ว +1

    Hi @ritvikmath, I am unsure whether the best idea is to create an model ensemble as your final step. One could use cross-validation around different hyperparameters to have a global idea of the error associated with each hyperparameter (or combinations of it). That said. Once you have the best tuning parameters, you can pretty much (either train and test in a Validation set), or use the whole dataset to best find the most appropriate *coefficients for you model. Otherwise, you may be wasting data points.

    • @rubencardenes
      @rubencardenes 3 ปีที่แล้ว

      I was going to comment something similar. Usually, I would use cross validation not to come up with a better model but to actually compare choices (different hyperparameters or different algorithms), so the comparison would not rely on the data that is randomly chosen.

  • @TerrenceShi
    @TerrenceShi 3 ปีที่แล้ว +1

    you explained this really well, thank you!

  • @gezahagnnegash9740
    @gezahagnnegash9740 ปีที่แล้ว

    Thanks for sharing, it's helpful for me!

    • @ritvikmath
      @ritvikmath  ปีที่แล้ว

      Glad it was helpful!

  • @cleansquirrel2084
    @cleansquirrel2084 4 ปีที่แล้ว +2

    Another beautiful video!

  • @taitai645
    @taitai645 2 ปีที่แล้ว

    Thanks for the video.
    Could we use CV for removing tendancy on data stacked by blocks ? Like block A for a period A, block B for a period B, .... block X for a period X with a risk of trend ?
    Please.

  • @ramoda13
    @ramoda13 ปีที่แล้ว

    Nice vidéo thanks

  • @liz-hn1qm
    @liz-hn1qm ปีที่แล้ว

    Thanks a lot!!! You just got my sub and like!!

  • @al38261
    @al38261 ปีที่แล้ว

    Really , a wonderful clear explanation! Many thanks! Is there any video or will be a video about time series & crossvalidation?

  • @ResilientFighter
    @ResilientFighter 4 ปีที่แล้ว +2

    Great job ritvik!

  • @onkarchothe6897
    @onkarchothe6897 3 ปีที่แล้ว

    can you suggest to me the book for cross-validation and k fold validation? which contains with example with solutions?

  • @ginospazzino7498
    @ginospazzino7498 3 ปีที่แล้ว +1

    Thanks for the video! Can you also do a video on nested cross validation?

    • @ritvikmath
      @ritvikmath  3 ปีที่แล้ว

      Great suggestion!

  • @davidhoopsfan
    @davidhoopsfan 3 ปีที่แล้ว +1

    hey man nice ucla volunteer center shirt, i have the same one haha

    • @ritvikmath
      @ritvikmath  3 ปีที่แล้ว +1

      Haha! Nice, after all these years still one of my fave shirts

  • @Guinhulol
    @Guinhulol 7 หลายเดือนก่อน

    Dude! Totally worthy to sign in and leave a like!

  • @gayatrisoni4525
    @gayatrisoni4525 2 ปีที่แล้ว

    very clear and excellent explanation. Thanks!🙂

  • @alinakapochka3993
    @alinakapochka3993 ปีที่แล้ว

    Thank you so much for this video! ☺

  • @hameddadgour
    @hameddadgour 2 ปีที่แล้ว

    Great content! Thank you for sharing.

  • @nivethathayarupan4550
    @nivethathayarupan4550 2 ปีที่แล้ว

    That is a very nice explnation. Thanks a lot

  • @Dresseurdecode
    @Dresseurdecode 2 ปีที่แล้ว

    You explain very well. Thank you

  • @Mara51029
    @Mara51029 6 หลายเดือนก่อน

    This excellent video for cross validation across TH-cam channel ❤

    • @ritvikmath
      @ritvikmath  6 หลายเดือนก่อน

      So glad!

  • @fatemeh2222
    @fatemeh2222 ปีที่แล้ว

    OMG exactly what I was looking for. Thanks!

    • @ritvikmath
      @ritvikmath  ปีที่แล้ว +1

      Glad I could help!

  • @bennyuhoranishema4765
    @bennyuhoranishema4765 ปีที่แล้ว

    Great Explanation!

  • @willd0047
    @willd0047 2 ปีที่แล้ว

    Underrated explanation

  • @almonddonut1818
    @almonddonut1818 2 ปีที่แล้ว

    Thank you for this!

  • @abdullahibabatunde2825
    @abdullahibabatunde2825 3 ปีที่แล้ว

    thanks, it is quite helpful

  • @rubisc
    @rubisc ปีที่แล้ว

    Great explanation with a simple example that's easy to digest

    • @ritvikmath
      @ritvikmath  ปีที่แล้ว

      Glad it was helpful!

  • @brockpodgurski6144
    @brockpodgurski6144 3 ปีที่แล้ว

    Excellent job man

  • @mahrokhebrahimi6863
    @mahrokhebrahimi6863 3 ปีที่แล้ว +1

    Bravo 👏

  • @best1games2studio3
    @best1games2studio3 2 ปีที่แล้ว

    great video!!

  • @krishnachauhan2850
    @krishnachauhan2850 2 ปีที่แล้ว

    You are awsome

  • @yogustavo12
    @yogustavo12 3 ปีที่แล้ว

    This was amazing!

  • @jefenovato4424
    @jefenovato4424 3 ปีที่แล้ว

    Great video! helped a lot!

  • @solomonbalogun7651
    @solomonbalogun7651 ปีที่แล้ว

    Great job explaining this 👏

    • @ritvikmath
      @ritvikmath  ปีที่แล้ว

      Glad it was helpful!

  • @brofessorsbooks3352
    @brofessorsbooks3352 3 ปีที่แล้ว

    I love you bro