Data Splitting using Cross Validation and Bootstrap in R
ฝัง
- เผยแพร่เมื่อ 7 ก.พ. 2025
- ☕If you would like to support, consider buying me a coffee ☕: buymeacoffee.c...
For one-on-one tutoring/consultation services: guide-tree-sta...
I offer one-on-one tutoring/consultation services for many topics related statistics/machine learning. You can also email me statsguidetree@gmail.com
For rcode and dataset: gist.github.co...
This video is a tutorial in R of various data splitting (i.e., model validation, data partitioning) methods with the caret package to estimate accuracy and error. I go over the following methods: test train hold out, leave one out cross validation, k-fold cross validation, repeated k-fold cross validation, and bootstrap 632. The dataset I use is the heart disease dataset. For a review on logistic regression models, please check out the video:
• Logistic Regression wi...
For formulas used to calculate the metrics provided in the output from the confusion matrix:
rdrr.io/cran/c...
Here is the r code:
##### Data splitting methods for model validation
##### Some method reviewied:
# Test/train
# Leave one out cross-validation (LOO CV)
# k-fold cross-validation (k-fold CV)
# repeated k-fold cross-validation (repeated k-fold CV)
# bootstrap resampling the 632 method
##### When should it be used and why is it important.
# Problems of overfitting adversly impact the
# generalizability of the model.
##### Differences between linear vs. logistic regression
#####################################################################################
# Load dataset for example
#####################################################################################
# Dataset of patients with heart failure
# find and load dataset downloaded from
# www.kaggle.com/andrewmvd/heart-failure-clinical-data
heart
Very informative video.
I am trying to train an RF model where I have 40+ independent variables. I am currently using k-fold CV with 3 repeats. It is taking a lot of time. How can I reduce the model training time? I am afraid if I will use bootstrap method, it may take even longer time.. 2-3days!!
Any suggestions??
thanks for your time, very complete information from the video for model validation. I just have a doubt, I have seen that some people divide the data into the training and test sets, and then the training set is given the k-fold CV, and in your code the k-fold CV is applied to the set of complete data, which is more correct? and why?
That is a really good question. Generally speaking you do not need to test/train split your data first before using k-fold CV. If your goal is to validate your model (i.e., evaluate the generalizability), you need to test it against data the model has not already seen -- and since you are already doing that with the k-fold CV you would not need to start it off with a test/train split.
I noticed the [method = "glm" ] was used in the LOOCV method, but what if you have a nominal dependent variable [outcome of 0/1/2], how can we run LOOCV on that? Any help is appreciated.
I've got nominal data (dependent variable outcome of 0/1/2], how do you run LOOCV on multinom model? Any help is appreciated.