00:02 Standardization makes features look like a standard normally distributed data with mean 0 and unit variables. 01:40 Applying standardization on specific integer and float number variables. 03:22 Standardize variables using StandardScaler from the pre-processing library. 05:17 Using StandardScaler for data preprocessing 06:50 StandardScaler transforms data to standardized values 08:35 StandardScaler transforms data to have mean 0 and variance 1 10:12 StandardScaler transformation on test data and analysis of mean and variance. 11:56 Using StandardScaler for data standardization in Python Crafted by Merlin AI.
Yes, you can calculate it manually using the z score formula or you can just search the standard scaler formula sklearn you will get it on the official documentation.
For the test data, we should be re-using the scaler object resulting from fitting only the train data, right? something like... ss = StandardScalar() ss.fit(X_train) ss.transform(X_train) ss.transform(X_test)
Very helpful tutorial, but I have a small problem. What to do if df.shape() returns an error : tuple object is not callable? Should I modify data type?
Great video! Do you know if when we implement StandardScaler through the Pipeline we are doing it this way or if we are doing a fit_transform? How would it be done this way? Thanks
Thank you. Here is the jupyternotebook and dataset link Notebook : github.com/siddiquiamir/Python-Data-Preprocessing/blob/main/StandardScaler.ipynb dataset: github.com/siddiquiamir/Python-Data-Preprocessing/blob/main/autompg.csv
Yes, we can transform these two columns as well because they are numeric and are on a different scale. Just for the purpose of demonstrating how to use standard scaler I used few columns only otherwise you can transform other numerical columns as well
The reason we standardize the data is that we have different variables on different scales. For example, age can be in the range of 0-120, and salary can range from 1000 to 10000000. So the weight of the salary variable will be more in the model and age will be less. To bring all variables in the same scale so that the weight of all the variables will be the same we use standardization.
Okay, I will make a separate video if you want more detailed information behind the scene. The formula of StandardScaler is (Xi-Xmean)/Xstd, so it adjusts the mean as a 0. It adjusts the mean to 0.
Yes bro, and also you can find the jupyter notebook on my github page. Below is the link dataset: github.com/siddiquiamir/Python-Data-Preprocessing/blob/main/autompg.csv Notebook: github.com/siddiquiamir/Python-Data-Preprocessing/blob/main/StandardScaler.ipynb
Good question. This helps in preventing information about the distribution of the test set from leaking into your model. By fitting the scaler on the full dataset (X) prior to splitting, information about the test set is used to transform the training set, which in turn is passed downstream.
Thank you! This explains things much more clearly than my textbook.
Thank you for your kind words.
Thanks. It will be helpful for beginners to let them know why/the purpose standardizing the features
Yes, it will be helpful for beginners. Thank you for the feedback.
00:02 Standardization makes features look like a standard normally distributed data with mean 0 and unit variables.
01:40 Applying standardization on specific integer and float number variables.
03:22 Standardize variables using StandardScaler from the pre-processing library.
05:17 Using StandardScaler for data preprocessing
06:50 StandardScaler transforms data to standardized values
08:35 StandardScaler transforms data to have mean 0 and variance 1
10:12 StandardScaler transformation on test data and analysis of mean and variance.
11:56 Using StandardScaler for data standardization in Python
Crafted by Merlin AI.
Interesting
Thanks for your effort. I really appreciate it.
I'm glad you liked it. You're welcome
Thank you for ur teaching. Just i don't understand what the ''axis = 0' means.
I'm glad you liked it. axis=0 means you are applying it on row, and axis=1 means you are applying on columns
thanks for sharing,
I want to ask if there is a manual calculation of the numbers formed from standardScaler processing?
Yes, you can calculate it manually using the z score formula or you can just search the standard scaler formula sklearn you will get it on the official documentation.
Thank you, this helped a lot!
You're welcome!
For the test data, we should be re-using the scaler object resulting from fitting only the train data, right?
something like...
ss = StandardScalar()
ss.fit(X_train)
ss.transform(X_train)
ss.transform(X_test)
Yes from the train data only
THANKS A TON SIR!
You're welcome!
Very helpful tutorial, but I have a small problem. What to do if df.shape() returns an error : tuple object is not callable? Should I modify data type?
Thanks. Look at the previous syntax or parenthesis.
only df.shape no brackets
@@susamay Ok
is that to say that the approximate values of the standard scalar mean of displacement and weight is zero?
It is applying the standard normal distribution.
Great video! Do you know if when we implement StandardScaler through the Pipeline we are doing it this way or if we are doing a fit_transform? How would it be done this way? Thanks
Yes, we can apply it through the pipeline. There is one video on the pipeline in my channel you can watch that.
Do we ever need to standardize the dependent variable "y"?
We don't need
hi sir, how can I calculate the standardized value from init value by mean and scale. I want apply for my program on my MCU. Hope your answer. Thanks
You can also find the user-defined function to perform the same operation.
@@StatsWire is that scale_ is standard deviation?
@@thaivuo2949 yes mean and std dev
now thats its scaled, now you just train model in this transformed data?
Yes
Very nice tutorial
Thank you
Great Video! Do you know where I can get the data set?
Thank you. Here is the jupyternotebook and dataset link
Notebook : github.com/siddiquiamir/Python-Data-Preprocessing/blob/main/StandardScaler.ipynb
dataset: github.com/siddiquiamir/Python-Data-Preprocessing/blob/main/autompg.csv
@@StatsWire Please provide it in description if possible
@@element6101 Sure
In this tutorial should we also transform mpg and acceleration columns?
Yes, we can transform these two columns as well because they are numeric and are on a different scale. Just for the purpose of demonstrating how to use standard scaler I used few columns only otherwise you can transform other numerical columns as well
what is the meaning of random state parameter while splitting the data?
Random state means when you split the data randomly but in every split you want the same samples not random samples then you have to use it.
it was a very helpful video but why do we need to standardize the data ??
The reason we standardize the data is that we have different variables on different scales. For example, age can be in the range of 0-120, and salary can range from 1000 to 10000000. So the weight of the salary variable will be more in the model and age will be less. To bring all variables in the same scale so that the weight of all the variables will be the same we use standardization.
@@StatsWire thank you for the explanation.keep doing the great work 👍
@@sheetalkumari8581 Thank you for your kind words Sheetal.
Just wanted to know how to get the mean easily.... Thanks
You can use NumPy to get the mean easily
> import NumPy as np
> np.mean(put any number)
But it will be bettter if you scale test dataset with training parameter by
scaled = StandardScalar().fit(train)
test_scaled = scaled.transform(test)
Yes
You did't explain, what exactly StandardScaler did behind the scene. you just explained how to do it.
Okay, I will make a separate video if you want more detailed information behind the scene. The formula of StandardScaler is (Xi-Xmean)/Xstd, so it adjusts the mean as a 0. It adjusts the mean to 0.
@@StatsWire thanks for clarification and quick response 👍
@@GridoWit You're welcome
this is same as z-score normalization?
Yes
@@StatsWire thanks sir ❤️
hi sir
how to normalize single row data
thanks in advance.
It normalizes row by row. You can give the row number
03:53 how to get suggestions while typing in Jupiter??
Press tab key after writing few words
@@StatsWire thanks
But we don fit the xtest right??
Right because this can lead to "data leakage"
Sir, do you know where I can find free tutorial teach that ?
May I know what do you want to learn?
@@StatsWire sklearn in practice
@@svitirur1665 You can learn from the official documentation. Here is the link
scikit-learn.org/stable/
bro can you provide the data that you used.
Yes bro, and also you can find the jupyter notebook on my github page. Below is the link
dataset: github.com/siddiquiamir/Python-Data-Preprocessing/blob/main/autompg.csv
Notebook: github.com/siddiquiamir/Python-Data-Preprocessing/blob/main/StandardScaler.ipynb
you have not shown how to transfer it back
We can also get back to the original scale with a few more lines of code. Maybe in the next video, I can show it. Thank you for the suggestion
@@StatsWire thank you
you didnt show how to inverse scale!!
I forgot to add that in the video
StandarScaler showing error
What is the error?
Why not transform X instead of Xtest and Xtrain separately ??
Good question. This helps in preventing information about the distribution of the test set from leaking into your model. By fitting the scaler on the full dataset (X) prior to splitting, information about the test set is used to transform the training set, which in turn is passed downstream.
@@StatsWire thanks for the quick response!
@@jameswood7207 You're welcome
@@jameswood7207 You're welcome