Inside OHE, can you please explain why we are using transform on X_valid for the values that we fit on the X_train? Should it not be done like separate fit_transform on X_train and X_valid because both of these are completely different data?
haha - it's embarrassing to admit but it took me ONE WEEK to get through a 55 minute video. good stuff abhi! i have a few concerns in some places, but i need to now go through the code WITHOUT your video, for clearer understanding, and may revert back with questions/concerns.
Thank you for the videos @Abhishek Thakur! I think instead of looping to select obj columns, you can use X_train.select_dtypes(exclude=['object']) returns dataframe without obj datatypes columns, bit easier to filter I suppose.
The benefit of ordinal over label encoding is only that Ordinal can be used on 2D data: (n_samples, n_features) where as with Label encoding, you will need to loop through the n_features. Small difference.
Please note that this is not quite true. Both methods are only equal for models which do not care about the order of the categories, which is why it is discouraged to use labelencoder on input data.
Do we have to learn like which model accepts which type of encoding? I mean as you said we can use label encoding with Decision trees, RF and GMB but with logistic reg and linear reg you prefer binarization. Also One hot encoding is also a type of binarization but still we used it with Random Forest.
Hi, you said In Random forest no need to do one-hot-encoding, just do label encoding it is fine. So is this true for only Random forest or any tree based models like XG-boost, GBM, Ada-Boost etc. ? thanks.
Hello, I want to make a final model and then predict using the X_test data set. Please find time and make for some of us the video for this. It will be of great help too. Thanks for the content though.
If the test set contains Nan values in the columns that were not dropped, it will give an error. how to deal with this? should we impute these values in test set ?
Hello Abhishek sir, I hope all is well with you. Just a quick question. In the Categorical Variables Exercise, The missing columns in the X_test are deleted based on the missing columns in the X dataset. However, I noticed that X_test contains some additional missing columns. Can you please advise how to handle this because my onehotencoding is fit and transform on X_train, and I cannot use this to transform the X_test as this will lead to error. Should I handle the missingness in X_test using axis=0 on selected columns? Thanks
While working on an AirQuality data set I get this error when I run ypred = model. predict(X_train) 'numpy.float64' object has no attribute 'predict' what is my mistake?
why is the original notebook splitting the X so early on? then you have to perform all functions TWICE, unnecessarily. Apply all functions/operations, then split, wouldnt this be better? Seriously asking if there is a valid reason for the split so early.
@@abhishekkrthakur thank you, I see it now in the description. Coming from core python background, the masking step ```s[s]``` looked very foreign. I had to spend a lot of time on my own to understand it. (20:30)
I am trying to run the notebook but it says "Draft session starting" and getting stucked in there. I updated my chrome browser, restarted it, flipped accelerators while running, but nothing helped. Can you please suggest ways to overcome this issue
If you like the videos, please do consider subscribing. It helps me keep motivated to make awesome videos like this one. :)
Inside OHE, can you please explain why we are using transform on X_valid for the values that we fit on the X_train? Should it not be done like separate fit_transform on X_train and X_valid because both of these are completely different data?
haha - it's embarrassing to admit but it took me ONE WEEK to get through a 55 minute video. good stuff abhi! i have a few concerns in some places, but i need to now go through the code WITHOUT your video, for clearer understanding, and may revert back with questions/concerns.
thanks for the encouragement that i am not the only one finding this difficulty 😄
It's like learning to swim from Phelps. Thanks Abhishek
Thank you for the videos @Abhishek Thakur! I think instead of looping to select obj columns, you can use
X_train.select_dtypes(exclude=['object']) returns dataframe without obj datatypes columns, bit easier to filter I suppose.
I was thinking about the same thing!
You are my savior 🙏🏼. I was able to perform other exercise, but this categorical variable part I was actually stuck. This helped me ✌🏼
Yes, I am also stuck pn 83% what to do in categorical exercise?
Thank you for a great video. I have learned a lot.
To the point explanation!!
The benefit of ordinal over label encoding is only that Ordinal can be used on 2D data: (n_samples, n_features) where as with Label encoding, you will need to loop through the n_features. Small difference.
Please note that this is not quite true. Both methods are only equal for models which do not care about the order of the categories, which is why it is discouraged to use labelencoder on input data.
I recommend everyone to buy "Approaching almost any machine learning problem " book
Already have it
Do we have to learn like which model accepts which type of encoding? I mean as you said we can use label encoding with Decision trees, RF and GMB but with logistic reg and linear reg you prefer binarization.
Also One hot encoding is also a type of binarization but still we used it with Random Forest.
Hi, you said In Random forest no need to do one-hot-encoding, just do label encoding it is fine. So is this true for only Random forest or any tree based models like XG-boost, GBM, Ada-Boost etc. ?
thanks.
yes. for all tree based models
Hello, I want to make a final model and then predict using the X_test data set. Please find time and make for some of us the video for this. It will be of great help too. Thanks for the content though.
many videos in this series have this part. if you go through them, you will find what you are looking for.
@@abhishekkrthakur oooh.. okay. Thanks
Why are we having a new label encoder in the loop for every column
cannot we just define it once outside the loop?
If the test set contains Nan values in the columns that were not dropped, it will give an error. how to deal with this? should we impute these values in test set ?
Hello Abhishek sir, I hope all is well with you. Just a quick question. In the Categorical Variables Exercise, The missing columns in the X_test are deleted based on the missing columns in the X dataset. However, I noticed that X_test contains some additional missing columns. Can you please advise how to handle this because my onehotencoding is fit and transform on X_train, and I cannot use this to transform the X_test as this will lead to error. Should I handle the missingness in X_test using axis=0 on selected columns? Thanks
But what about the order of ordinal encoder? It seems our code never defines whether "never" should be 1 or 2 or 100
What is the use of putting axis = 1 ?
axis=0 => rows, axis=1 => columns
@@abhishekkrthakur Thank you sir
While working on an AirQuality data set I get this error when I run ypred = model. predict(X_train)
'numpy.float64' object has no attribute 'predict'
what is my mistake?
why is the original notebook splitting the X so early on? then you have to perform all functions TWICE, unnecessarily. Apply all functions/operations, then split, wouldnt this be better? Seriously asking if there is a valid reason for the split so early.
What is the free book link at the print comment? (13:50)
bit.ly/approachingml
@@abhishekkrthakur thank you, I see it now in the description. Coming from core python background, the masking step ```s[s]``` looked very foreign. I had to spend a lot of time on my own to understand it. (20:30)
I am trying to run the notebook but it says "Draft session starting" and getting stucked in there. I updated my chrome browser, restarted it, flipped accelerators while running, but nothing helped. Can you please suggest ways to overcome this issue
you need to run a cell. if it still doesnt work after a few mins, contact kaggle support :)
@@abhishekkrthakur Thanks for response. I tried running as suggested, but still not working. Raised help ticket in Kaggle. Thanks.
My man Abhishek to the rescue 🔥🔥🔥🔥
sorry dumb questions, why is axis=1?
Axis=1 refers to column and axis=0 refers to rows
if categorical features has missing values in test set what will onehotencoder do?