Wow! I have been trying every transformation under the sun for several of my variables for 2 straight weeks with no luck. This is like magic. Now I just might finish my PhD dissertation by the end of the summer after all. Many thanks!
When using this technique in research, it may help in the peer review process to cite the published article referenced at the end of the video: Templeton, G.F. 2011. "A Two-Step Approach for Transforming Continuous Variables to Normal: Implications and Recommendations for IS Research," Communications of the AIS, Vol. 28, Article 4.
Thanks a lot Gary, I had been struggling to normalize my skewed data but when I used the two steps in your paper and video that you explain clearly, my data is now normal - confirmed by Kolmogorov-Smirnov and Shapiro-Wilk tests. Very helpful video!
People often ask why their sample size is reduced by 1 when using this technique. The reason this happens is as a result of the first step, the values range from 1/n to 1. All values must be a fraction for step 2 to work, so it skips over the 1 (associated with the biggest value). In order to fix this, you should replace the missing value (the result of applying step 2 to the 1) with 1-(1/n). For example, if you start with a sample of 1,000, the Two-Step will likely result in a sample of 999. To use the missing record, you'll need to find it (it's the "1" value resulting from the first step among all cases). Replace the 1 with 1-(1/1000), or 1-.001, or .999. This won't change results much, but will ensure every case is used. Of course, I'd put a small note in the paper about any transformation step needed.
Is there an automated way to do this replacement? I have over one-hundred columns of data ranked and each one has a value of "1.00". Do i manually need to go in and find the 1.00 in each column and change it to .999?
thank u so much...this video is too much informative..my data are not normally distributed but after watching this video..i apply this procedure.now my data was normal.
Thanks a lot, Dr. Templeton, It is really helpful, I used mentioned process and found it to be useful not only for me but also for my entire department.
You save my day, thanks a lot, i think fi you want to obtain the mean and de standard deviation you need to process you data before apply this method, you are gonna be cited in my thesis!!!!!
My question is, from where you wrote the value for the second question mark (?) and from the third question mark (?). You didn't show that from where you got the series mean that copied and past from notepad and the standard deviation. I will really appreciate if you could help me in this regards. IDF.NORMAL(RDistanc,?,?)
@@norwegianresearchtraininginsti you can find it in his paper, it says: To accomplish Step 2 in Excel, use the NORMINV() function, having the following syntax: NORMINV(Step 1 result, imposed mean, imposed standard deviation) Where, Step 1 result = the result of Step 1, which must be in probability form Imposed mean = mean of the variable resulting from the transformation (!!) Imposed standard deviation = standard deviation of the resulting variable (!!)
I was really struggling to work out how to make my data normally distributed in order to do my analysis and this video has saved me. Thank you so much for taking the time to share this method with is, and answer our queries! I really appreciate it :)
I guess, the interpretation does not change much because of this transformation, i.e. st. deviations and means stay the same, while kurtosis and skewness significantly improve. Also, this technique solves the problem with outliers (that are actually not). thanks a lot for such a great solution!
Thank you. This saved my life. I had been struggling with numerous transformations, but did not work. It worked for me. I also used the 1-(1/n). You get a citation from me.
Hello Dr. Templeton, I am a PhD student and have found that using the technique mentioned greatly improves the skewness and kurtosis value. However data is still not normally distributed. I have also tried log10 and loge transformations. Is there anything else that I can use? I dont have the option of dichotomizing the data. Please can I have the copy of your article for further details
Hi Kamalpreet, I got the same problem too. I applied Dr. Templeton's technique but my data is still skewed. Did you figure out how to transform the data into normal distribution? I want to perform a 3way anova so can't just use KW test. Appreciate if you could get back, thanks:)
Hi, I'm also interested in normalizing a variabe. I have used: ln(x), log(10), 1/x, sqrt(x) and this method but nothing works.. I have heard about johnson transformation method. I haven't tried yet, but it said this method works almost always since it finds an optimal function that normalize your data. Let me try and I will tell you, If somebody knows how to use this method in spss please share the info =)
Thanks Gary for the wonderful video and the article. I always have trouble when normalising the data since transformation like log doesn’t usually work.. but this is great .. simply wonderful. Thank u again
Gary, I ran several times the procedure on both SPSS and EXCEL using the same data set. Apparently, the outputs are inconsistent. Not sure what might cause the difference. I double checked the formula as well described on your paper. Here is the excel formula: To get the percent rank =IF(B4="","",IF(PERCENTRANK(B$2:B$50,B4)=1,0.9999,IF(PERCENTRANK(B$2:B$50,B4)=0,0.0001,PERCENTRANK(B$2:B$50,B4)))) To get the inverse of the Cumulative Normal Distribution =IF(B115="","",NORMINV(B115,0,1)) Running data set with replaced outliers with mean and on the original data produce some significant changes. So, replacing outliers with means doesn't look a reliable method to apply. Now, I am thinking to Winsorize my original data? Do you have any recommendation on it to not miss any single outlier? My data is both hugely negatively skewed and has outliers. They make it hard to figure what is the best way to do. I am think to Robust Statistics as well given my data. Any thought on that? Huge thanks.
Great video Gary, thank you. Just like below, I have some of my variables having an "out of range" error: >At least one of the arguments to the IDF.NORMAL function is out of range. The >first argument (probability) must be positive and less than one. The third >argument must be positive. The result has been set to the system-missing >value. Why is this the case?
I did all the steps several times and used my own data series' mean & standard deviation, his numbers , and 0,1 but every time this error appears: >At least one of the arguments to the IDF.NORMAL function is out of range. The >first argument (probability) must be positive and less than one. The third >argument must be positive. The result has been set to the system-missing >value. what should I do? when I used 0,1 and my numbers despite of this error series of new data appeared as normalized data but I don't know they are reliable or not...
Did you use three arguments? The syntax for the second step requires 1) the result of step 1 (this is in fractional or probability form, 2) mean, 3) standard deviation.
@@gftempleton thanks for your answer. yes I did. for both data series this error came up but also new columns was added to my spss worksheet! I'm gonna use them but considering these errors I don't know how reliable are they...
Hi Mr.Templeton. I have 8 groups of data to analyze. Some of them are normal and some of them are not. Should I implement your method for all of the groups in order to compare them in one -way Anova analysis? Please please help...
Thank you so much for this! I'm currently doing my dissertation and the non-normal data kind of shot me in the foot for the proposed analytical methods. Much appreciated!
Thanks Gary. But every time I run the trasnformation, this error appears. >At least one of the arguments to the IDF.NORMAL function is out of range. The >first argument (probability) must be positive and less than one. The third >argument must be positive. The result has been set to the system-missing >value. What should I do? Where do I copy the mean and the standard deviation? Thanks in advance
Step 2 will not work on 0's or 1's. If the problem is a 0, convert it to 1/n as an estimate. If the problem is a 1, convert it using 1-(1/n) to estimate.
@@gftempleton Hi Gary, thanks for replying. I meant the mean and the standard deviation we have to put into the quote at 1:38. What should we put into it? the original mean and STD or after step1 transformation mean and STD :)?
Not original units, but normalized units Example If you transform assets to normal and put it in an equation, it is interpreted as normalized assets It's the same as with any transformation
Thanks Gary. One Question is about the shapiro test.Once I transformed my variables, all improved in terms of skewness and kurtosis. However, the shaprio-wilk test still shows non-normal distribution (p
Feel free to try other transformations (e.g., natural log, power transformations, truncating, winsorizing). However, that is time consuming. If you can find a statistical package that uses Box-Cox, which is tests many different power options, that may be a good use of time. However, reviewers of your work could also tolerate that you attempted a normality transformation that improved the situation. Worst case, you'll have to use non-parametric procedures (which, coincidentally utilize transformations - usually ranking).
Thank you very much for such an instructional article and for the follow up video. I've been able to normalize my data following your method but still have a doubt. I've a non-normally distributed variable "SOM" (metric) which assumes values for the years 2002 and 2012 (nominal variable named SAMPLE only assuming the value '1' for 2002 and '2' for 2012 collected samples). I'm now able to 'globally' normalize my SOM variable with the 'Two-Step Transformation' BUT when I now do a ANALYZE --> DESCRIPTIVE STATISTICS --> EXPLORE with a split file or a 'Factor List' by the variable SAMPLE and re-analyze the normality tests I state that only the 2012 samples are normally distributed with the 2002 not being normally distributed and don't know how to resolve this. I'm stating this particular case but I also need to split my variable SOM even further (i.e. "date collected AND soil type" or "Date collected AND soil type AND cultivation system", etc). Or is this a non-issue because the 'global variable SOM' is now normally distributed? I'm having this issue recursively and simply cannot find the answer to this problem. If you find the time to enlighten me on the issue it will help a lot. Thanks anyway for such a great transformation.
Once we have normality, how can I run a regression with the original data (taking into account the normalized data) so that I can use it in my predictive model?
Hello, I followed the steps above. And one of Fractional rank value was 1, and it would be a missing data(no data shown) after the transformation . I don’t know how to solve the problems 😅 Look forward to your reply!
Thanks for the video and the reference, Prof Templeton. When computing the Fractional Rank of some of my variables I end up having a value =1 (highest value on that variable), which then creates an "out of range" error on the IDF.Normal function as the range of values it accepts is 0 to less than 1. This does not happen with all variables, but just some. Any hints as to why this happens and how to address this in this transformation to normality would be appreciate it. Thanks
Hi Fidel Vila, I think I've worked it out. I think there must be some rounding errors, which means that the probability (first argument) ends up being interpreted as being out of range (it has to be within 0 and 1). I'm not sure how this happens (perhaps the calculations for the mean and SD need to be to more significant places, but I've tried this and it doesn't remove the error), but I have worked out a fix which is a bit of a "fudge": Say you have a variable X to be normalised, with mean MEAN and standard deviation SD. Lets suppose you have conducted the fractional rank and made a variable RX. You then do the following: >COMPUTE X_norm=IDF.NORMAL(RX/1.001,MEAN,SD). >EXECUTE. Dividing RX by 1.001 ensures that the variable is kept within the allowed range. (Although I repeat: I am not sure why it is interpreted as being out of range - as far as I can see, my variables all fall into 0 and 1, so it must be to do with rounding errors for the mean and SD calculations). Hope this helps!
Thanks a lot for saving my say. But you did not mention initially that I would need a 1-(1/n) transformation before the final inversion. Thanks all the same.
Hi Gary. First, thanks for your informative video. I was dealing with a few very non-normal distributions, and this method worked wonderfully in normalizing the data. That said, I have one question for you, the answer to which I cannot seem to figure out. Namely, in the video description, you note that, "This approach retains the original series mean and standard deviation to improve the interpretation of results." However, I have not found this to be the case. Although the means and SDs for the transformed variables are quite similar to he original series' means and SDs, they are not perfectly retained. At least, this was true in cases where a value of 1 was generated after completing the first (fractional rank-order step), even when I used the formula, to replace the 1, that you mentioned in your response to a comment, below (i.e., replace the 1 with 1-(1/n)). Any clarification here would be much appreciated. Thanks again for your informative video.
Another reason they aren't exactly the original mean and standard deviation is because of inflated frequencies (stacks of the same value) that are some distance from the mean. If there were no 'same values" in the dataset, the resulting mean and standard deviation would be the exact mean and standard deviation and the original set. The approach "tries" to do that at least.
1-(1/n)) is a close approximation that allows researchers to lose a record. Consider it part of the procedure - just like the first two steps. You may be right, that may cause the mean and standard deviations parameters to vary slightly. I don't think it would affect interpretations much. Sample size is a bigger issue in a lot of cases. This should be up to the researcher to decide.
Hello My question is, i have a serie whose distribution does not follow the normal distribution, I tried the logarithmic transformation on Eviews but the p-value of jarque-bera is always lower than 0.05 So what transformation to do in Eviews?
Eviews has each step. The first is a fractional rank (rank represented in proportions) and the second is a normal inverse function. It appears "Normal (Gaussian)" is showing you this here: www.eviews.com/help/helpintro.html#page/content/mathapp-Statistical_Distribution_Functions.html
Wow, Thanks Gary. This is a great method. I used Log10 and square root transformations to normalized the distribution of my data. None of them worked, and my data was still negatively skewed. I used this two-step transformation and it worked great. At first, I have a hard time finding the corresponding mean of the series and SD. After looking at the paper referenced, I found out that you have two choices. You can either put 0 as a mean and 1 as a SD for the arguments of the function (e.g. IDF.NORMAL(a new ranked series,0,1) or put the mean and SD of the original series to maintain the unit of data. When I put the original mean and SD of the series, I worked just fine and data looked normal. However, when I replaced the parameters with 0 and 1, SPSS returned no value. Not sure why. I also noticed that my sample size didn't reduce by 1 after the transformation. My sample size is small (50). Are you supposed to lose one sample after the transformation? Am I doing it in a wrong way not seeing this result?
So you went from standardized normal (0 and 1 parameters) to normalized (original mean and sd) and back to standardized? Did you save variables with the same names? That may be the problem. I've used the technique en masse and have never heard of that. It seems like you need to make sure you use unique variable names. Low sample size would help retain all records. I don't know what the threshold is. Do you know what the fix is if it does become a problem? Replace the max value (1) in the results of Step 1 with 1-(1/n).
Given the problems with SPSS, I actually ended up using the excel formula you provided in the paper which I found much easier for data with many variables (33 in my case). It worked just fine and all my variables except one are normally distributed now (that non-normal variable looked normal too me, but Shapiro-Wilk test was significant after all ). I was able to retain all my sample (50) using the excel formula. What do you think about that? I also went ahead and ran outlier testing giving a g value of 2.2. To my surprise, most variables have at least two outliers. That's surprising as I replaced all outliers with the mean from the original data before running the transformation in excel. The g factor 2.2 is usually considered pretty large to retain almost all data. That's very interesting to see such a pattern. But I still trust this data with outliers better than my original non-normally distributed data unless I am missing something significant here.
I perform the Two-Step in Excel often - there is no difference as far as I can tell - the same two steps are readily available. I would never replace an outlier with the mean! You're changing the data and may be suppressing or hiding results. If your data is sufficiently normal, don't worry about outliers. And, your data doesn't have to be perfectly normal. If it is terribly non-normal and you use non-parametrics, your data isn't normal anyway. For example, if you use Spearman's rank, the data is transformed to uniform (not normal).
Thanks for taking time and providing the great pointers. The scale of my data is continuous. I don't remember where I saw, but replacing outliers with mean was explained as a reliable method. Unfortunately, my data is pretty skewed, having skewness value of .9 or something like that. My sample size is also pretty small (45), so, not missing even one sample is important. What I would do, I would put back all outliers in their place and run everything all again. That's a pain, but I am curious to see the difference.
Gary, I ran several times the procedure on both SPSS and EXCEL using the same data set. Apparently, the outputs are inconsistent. Not sure what might cause the difference. I double checked the formula as well described on your paper. Here is the excel formula: To get the percent rank =IF(B4="","",IF(PERCENTRANK(B$2:B$50,B4)=1,0.9999,IF(PERCENTRANK(B$2:B$50,B4)=0,0.0001,PERCENTRANK(B$2:B$50,B4)))) To get the inverse of the Cumulative Normal Distribution =IF(B115="","",NORMINV(B115,0,1)) Running data set with replaced outliers with mean and on the original data produce some significant changes. So, replacing outliers with means doesn't look a reliable method to apply. Now, I am thinking to Winsorize my original data? Do you have any recommendation on it to not miss any single outlier? My data is both hugely negatively skewed and has outliers. They make it hard to figure what is the best way to do. I am think to Robust Statistics as well given my data. Any thought on that? Huge thanks.
Thanks Gray. How to interpret coefficients after converting dependent variable using IDF.normal function? for example, if one unit increase in the independent variable, how it is affecting the dependent variable? Thanks,
Assuming you transform using the series mean and standard deviation, Interpret exactly the same as you would original units. I would note that you normalized the original units. Alternatively, you can transform using mean=0 and sd=1 and interpret as standardized normal original units.
Thanks, Gary, it is very helpful what you have done. Are you maybe aware, if there's any critical peer review or papers out there regarding this method of yours? Thanks, for the answer.
This is an amazing method. I'm wondering if there's an added value to winsorizing or otherwise capping variables before the transformation. Some of the variables I have clinical variables for which have a case or two have extreme outliers and are also non-normal. Using the means and standard deviations for these variables seems a little weird to me because the Ms and SDs before winsorizing don't seem within the range of values usually seen in my patient population. If I winsorize before transforming, the Ms and SDs seem a little more representative... Am I completely off here?
Thank you very much for wonderful method, may I ask a question please. I have the data that including 2 group in 1 variable when I use this method, should I split file for separate group, because when I compare mean difference by T-test statistic the result will be changing? For example I have walking speed for 1500 persons this data incliding heart disease person (N=245)and non-heart disease person. When I fix to normal distribution can I fix in one time (N=1500) or I should split file to heart disease and non heart disease group, both of data not normal distribution. I look forward to hearing from you soon. thank you very much.
I'm not 100% sure if I understand your question. Assuming you will be standardizing the data along with the Two-Step (mean=0, sd=1)...if you want to test difference between two groups, I would stack them all together and transform one time. If you use original series mean and standard deviation, you'd have to compute the values for both groups, then normalize separately.
Dear Gary, Can you please tell me what are the implications of using this technique on likert scale. for instance, I have used a likert scale in which 1 is strongly disagree and 7 being strongly agree. Does it inverse the relation or what ? Thanks
If you are asking about units, just say the results are in normalized units. Of course, you would explain you used the Two-Step in the methods. Models using original and transformed (e.g., natural log or Two-Step) are separate models (i.e., different error terms) and should be interpreted differently (this is not so obvious to some but models are commonly treated distinctly). Regarding the method, step 1 is simply a fractional rank and step 2 is the application of the inverse normal function applied to the results of step 1.
Very useful information but I'm getting lost where you are copying the mean and standard deviation as such I'm stuck. Kindly help where to copy the mean and the standard deviation. You only mentioned that you copy from your notepad, but what about me, where do I copy from? I'm stuck, someone help asap please.
Thank you for the video and your paper! I used the two-step method for my non-normal data, and all turned to normal distributions! The only concern remained is that if I am allowed to use this method for my data, which are drawn based on 4- and 5-point Likert scales??! I read at your article to use this method mostly for higher levels (up to 100)! I would appreciate it if you could tell me whether I can use this method for 4- and 5-point Likert scales or not! Thanks in advance!
Thanks Gary It is just awesome. I have one inquiry: How can I transform back from the step 2 to the original data? For example, After I did the two steps I got I mean of (-0.007).So, How am I going to report that?!
Would using this method after the fact be considered a linear-linear regression, a log-log or what? Also, would transforming the variables post-processing back to say a scale from one to ten be considered good practice in terms of easing interpretation?
Hi, thanks for the share. I tried the method, and it works to normalize the dataset, however, why the sample size is reduced after the procedure? For example, why the sample size reduced from 6843 to 6842 above? Would that affect the conclusion?
Please give us a reasonable answer why the sample is decreasing after two-step process. Why the missing figure is coming? How can we interpret this problem in research paper? Thank you in advance.
i saw this video before and it was helpful i followed your steps and the results as it occured i posted them to you in the first comment i posted the result at your friend's channel James Gaskin and he recommended this video for me pleas help i don't know what to do:(
Qno.1 I have 3 dependent variables. Two of them are in range of normal skewness value i.e. +1 to -1 and have kurtosis in range of +3 to -3, but the third remaining dependent variable is not in normal range of skewnes or kurtosis. I want to transform that variable with square root transform to run parametric tests. So the question is, Can I transform that one variable only and run parametric test on the variables or I should transform all three variables before doing test? should I transform all three variables together even the two of them are already normally distributed? will it create problems to transform only one non normal variable? q.no.2 Can I infer and interpret my data for normality on the basis of skewness and kurtosis only rather than gooing for shapiro wilk test?
The technique will transform any variable toward normality, except for binary variables. That being said, there will be a variety of results depending on the situation. I have experienced positive, yet less beneficial results when applying the two-step to Likert scale data. While ORIGINAL data based on Likert-based items are often significantly non-normal, summing the items usually results in fairly good normality. Because non-normality isn't usually a terrible problem with Likert-based data, transforming toward normal doesn't help that much. On the other hand, transforming highly continuous data, especially ratios, will often yield tremendous downstream benefits in scientific testing. Good luck!
Why do we need to do 2 steps? Can't I just use the fractional rank? For example, my BMI variable was skewed and we wanted to do a GEE with that. Can't I just use the fractional rank as my new BMI?
Thank you for this helpful video. I was wondering.... what is the use of the bootstrap option in Amos ? is it better to perform a transformation or is bootstrap sufficient when performing a CFA ?
+Miriam Roussel This would make the subject of a good research paper. I believe bootstrapping has its own weaknesses, as does any transformation. I prefer using normalized, real data.
Hi, I used the same method on all three of my non-normal variables and it turned out fine for two only. Are there any assumptions or characteristics of the data needed to successfully transform to normality? I tried the other ways (log10 etc. ) of transforming too, but I still can't get normal distribution for that particular variable. Do you have any recommendations that I can try for normality? TQ
Yishan - for the one variable that won't transform to normal, the problem may be the characteristics of the original distribution. If you don't have many levels (e.g., binary=2 levels) or if there is an "inflated frequency" (e.g., stacks of zeroes or other values), then it won't transform to normal regardless of what method you use. I would do the best you can, tell the reviewers what you did, and hope for the best.
Thanks for the clarification. That variable is measuring participant's depressive score, and there are 120 cases with zero value out of 1000+ cases. I'm trying to remain all data as I think zero score can be meaningful in the research. By using the 'two-step approach', I was able to reduce the skewness from 1.312 to 0.184 (std error= .076), which later obtained Z value within the range of +-2.58. Despite the 'non-normal' shape of histogram and significant p value of Kolmogorov-Smirnov test, can I still assume that data has met the assumption of normality (based on Z value) and proceed with the transformed variable?
Hi, thanks for this video. I do have a clarification question. I'm not sure what to use as second and third arguments in the Idf.Normal function. In your 2011 paper, I read that, by default, one can use 0 and 1, respectively, for the desired mean and standard deviation but, in the video, you mention a series' mean and standard deviation. Is it the mean and standard deviation of the (fraction rank) transformed variable?
@@nicolasvanderlinden8569 so we should use the numbers that he uses instead of our own mean and standard deviation? what about 0 & 1? I'm so confused...
Hi, I just have one doubt if we want to convert back to absolute value how can we we do that.. for example i have regression model and i converted the dependent variable and now i want to see what will be the absolute value of y
It depends on the units you use. There are two basic uses of the Two-Step: 1) convert to standardized units (use mean=0 and sd=1 in the second step) or 2) convert to normalized original units (use original series mean and sd). So, the interpretation depends on usage. If you use the second step, there is no reason to convert back as you are in the original units (just normalized).
Just one more question to be sure, so after applying this method for transformation the regression equation remains same i.e y=b0+x1*b1+x2*b3 and it doesn't change like it changes when we do log transformation.
Reverting back isn't necessary if you transform using the original series mean and standard deviation. You are already in the original units. Also, remember that using the exponential function to revert back to original units from logged units is problematic when some original values are negative. In that case, the natural log would produce missing values. To avoid this, researchers will shift the values so none of them are negative, then do the natural log transformation. This means reverting back using the exponential is useless, unless the preconditioning is reversed appropriately. The natural log has many flaws and is inferior to the Two-Step in achieving normality and achieving significant results. See: Templeton, G.F. and Burney, L. 2017 . “Using a Two-Step Transformation to Address Non-Normality from a Business Value of Information Technology Perspective,” Journal of Information Systems, Vol. 31, No. 2, pp. 149-164.
Thank you so much for this very informative video. I have one question. I used this method to transform my metabolome data so it would be normally distributed. Then I performed several analyses, including a linear regression model. How would I interpret the Beta from this analysis? Or is there a way to transform this Beta back to its previous "value". I hope to hear from you!
It is in "normalized" units. Just report your transformation and consider the model with transformed units to be separate from not transformed. Regarding transforming back to previous value, I'm not an advocate of that since the model with transformed units satisfies test assumptions differently than original data. The only reason to do the transformation is to satisfy test assumptions. So, I don't agree you can transit back and forth between the two models. They are different. Perhaps you can report both. I don't believe in doing it with log transformed data either.
@@gftempleton Thank you very much for your quick response. As I look back to one of your previous answers, can I then assume that I still have the same units after using the transformation with series mean and SD? So my estimate of say 1.55 is 1.55 mmol/l (for example). No need to transform it back? Previous answer: "It depends on the units you use. There are two basic uses of the Two-Step: 1) convert to standardized units (use mean=0 and sd=1 in the second step) or 2) convert to normalized original units (use original series mean and sd). So, the interpretation depends on usage. If you use the second step, there is no reason to convert back as you are in the original units (just normalized)."
@@joellevergroesen1050 Not exactly the same units. Instead of "X," it would be "Normalized X." Transiting back has its issues, mainly that the distributions (or error distributions in multivariate methods) have different distributions and therefore the two measures sit in different contexts. It is better, in my opinion, to view two models with different variables as being uniquely different.
Yes it is a great explanation but i have try on my own data but unable to get normality..i've used log10 and sqrt. the results still the same...a bit changes but no changes on normality. what to do. ple advice.tq
how do we get the value back from transformed data i.e after i perform transformation i do regression using normal value now after the result i need to know how to get the actual data from transformed data. For logarithmic transform we use the base to get the value back how do we do it here
wow nice techniques! Thanks! I have one question. Can I use this as for an already normal data because I am using paired sample t test and i think that this should be applied in both data for equal comparisons. And another more, does this techniques is great for observation? because it seems so perfect. Thanks!
Thanks a ton for this! This was of great help! I had a quick question, and I would be really grateful if you could help me out with this. Tried this on 7 of my variables. 4 of them got transformed, but 3 of them still haven't. Does this mean that they cannot be transformed or is there another way to do this? The sample size is 50, and I am using the Q-Q plot and Shapiro Wilk to test for normality. Thanks in advance, Any help would be greatly appreciated. Thanks again!
hello, i have a problem i need a help with.Is the process of removing outliers from a variable more than one time considered manipulating or changing the data?i have loans for public. its mean .17093 st.dv .955838 skewness 7.571 kurtosis 61.436 most of the cases of this loan is an outliers after several times of ranking and replacing the missing values with the mean i reach this output mean .2970 stdv .22582 skewness 2.301 kurtisos 3.885 and it ends ub to be positively skewed. i dont know what to do shall i keep it this way or take thevery first one or do i have to continue knowing that the percentiles 5, 10, 25,50 and 75 ends up with the same number.2072. And i still have to do the regression please help:(
thank you so much professor: Gray its amazing video, but i wanna ask a question to you to ensure it and used this method in my paper, if i fitted a gamma model and extract the (Residuals) ind i wanna use this residuals but its non normal distribution if i follow up this steps in this video and changes the values of the data its applicable to use it in my paper as the same residuals ??
Hello, I have 4 non-normal variable in my dataset. Do I need to individually perform these steps for each of 4 non-normal variables? or is there any other method??
This two-step process has three step major issues. One: by using ‘Function group,’ Inverse DF,’ and data from notepad such as ‘series mean and SD’ you transformed the data. Can we use 0 and 1 as series mean and SD as you claim in your paper? Two: after transforming, you will reduce sample size by 1. That means if you have five Three: We have another major issue with this transformation. Now, you can do statistical tests on the transformed data and here is the big problem. Reporting mean and standard deviation in the 'transformed unit' is not the purpose of almost every research. How do you back-transform your result after this ‘two-step transformation’ to explain the results in original data? If you cannot back-transform the results, this is not acceptable in research. Please explain how to deal with all these three issues, thank you.
The three items you list are in no way "major issues:" 1) simply standardizes the data. Another option is to use the original series mean and standard deviation; 2) there is a simple fix - impute missing value using 1-(1/n); 3) this is an issue with every transformation, including the natural log and probabilities; the Two-Step is the only approach that allows the researcher to use the original mean and standard deviation as arguments to the result will emulate the original units.
First of all thank you very much for this approach saves me a lot of time and effort. My question is: I have a dependent variable measuring "click intention" which can be measured from 0 to 100. After normalizing the data however I get 3 negative results and 2 above 100. Is it acceptable to keep it this way? Thank you very much!
How do i write an equation using IDF normal function: For log i can use -- Returns = a + bo log (Beta) + b1 log (Leverage). How do i write equation using this function?
@@chetanasanghavi1576 You can also look in articles that published using the procedure: scholar.google.com/scholar?oi=bibs&hl=en&cites=6281913823923514896
Wow! I have been trying every transformation under the sun for several of my variables for 2 straight weeks with no luck. This is like magic. Now I just might finish my PhD dissertation by the end of the summer after all. Many thanks!
For anyone who doesn't know you use the series mean and standard deviation that he uses in the video. IT WORKS AND HE SAVED MY LIFE!
When using this technique in research, it may help in the peer review process to cite the published article referenced at the end of the video:
Templeton, G.F. 2011. "A Two-Step Approach for Transforming Continuous Variables to Normal: Implications and Recommendations for IS Research," Communications of the AIS, Vol. 28, Article 4.
Thank you so much!
You're welcome, Charlotte!
Hi Gary! Should I transform my variables like this for conducting a Principle Component Analysis in order to form an index? Thank you for your video!
Thank you so much...its 2021 now ur video save my life !
@@veronicawong9023 orang mana?
Gary your tutorial just saved my day. Been struggling with different transformation techniques. Seeing yours just brightened my day. Danke!!!
Thanks a lot Gary, I had been struggling to normalize my skewed data but when I used the two steps in your paper and video that you explain clearly, my data is now normal - confirmed by Kolmogorov-Smirnov and Shapiro-Wilk tests. Very helpful video!
+Oscar Onam That's great, Oscar. Good luck.
Sometimes, the KS test and SW test refutes the normality hypothesis, although the skewness and kurtosis values are ok.
People often ask why their sample size is reduced by 1 when using this technique. The reason this happens is as a result of the first step, the values range from 1/n to 1. All values must be a fraction for step 2 to work, so it skips over the 1 (associated with the biggest value). In order to fix this, you should replace the missing value (the result of applying step 2 to the 1) with 1-(1/n).
For example, if you start with a sample of 1,000, the Two-Step will likely result in a sample of 999. To use the missing record, you'll need to find it (it's the "1" value resulting from the first step among all cases). Replace the 1 with 1-(1/1000), or 1-.001, or .999.
This won't change results much, but will ensure every case is used.
Of course, I'd put a small note in the paper about any transformation step needed.
Hi i would like to ask, how many times am i allowed to normalized the same data? thanks in advance!
In my opinion, there are no rules as long as you report exactly what you have done. Let the reviewers or advisors help you.
Is there an automated way to do this replacement? I have over one-hundred columns of data ranked and each one has a value of "1.00". Do i manually need to go in and find the 1.00 in each column and change it to .999?
thank u so much...this video is too much informative..my data are not normally distributed but after watching this video..i apply this procedure.now my data was normal.
Thanks a lot, Dr. Templeton, It is really helpful, I used mentioned process and found it to be useful not only for me but also for my entire department.
You save my day, thanks a lot, i think fi you want to obtain the mean and de standard deviation you need to process you data before apply this method, you are gonna be cited in my thesis!!!!!
Hi Mr Arturo
Do you have any idea about how we can transform back the data from this form when we want to report the results!!
Thank you
My question is, from where you wrote the value for the second question mark (?) and from the third question mark (?). You didn't show that from where you got the series mean that copied and past from notepad and the standard deviation. I will really appreciate if you could help me in this regards.
IDF.NORMAL(RDistanc,?,?)
did you find out the answer to this your question you share with me. He has mentioned he copied them from Notepad that is what I have heard
@@norwegianresearchtraininginsti you can find it in his paper, it says:
To accomplish Step 2 in Excel, use the NORMINV() function, having the following syntax:
NORMINV(Step 1 result, imposed mean, imposed standard deviation)
Where,
Step 1 result = the result of Step 1, which must be in probability form
Imposed mean = mean of the variable resulting from the transformation (!!)
Imposed standard deviation = standard deviation of the resulting variable (!!)
@@juliabachmann639 I will check that his paper Iam interested in that method
@@juliabachmann639 I think it's the mean of the variable that has been transformed. You can see these values in the Histogram chart at 2.33
@@doancongthanh93 ahh okay, I see. thank you- that's very helpful!!
I was really struggling to work out how to make my data normally distributed in order to do my analysis and this video has saved me. Thank you so much for taking the time to share this method with is, and answer our queries! I really appreciate it :)
Ellie King where did he get the second and third value ? (,?,?)
@@riderho1 My understanding (and what I used) is the second and third values are the mean and SD of the variable that you are transforming.
What a great solution! Thank you very much Gary for your help!
This video has great value. Thank you so much, Gary for saving my day
I guess, the interpretation does not change much because of this transformation, i.e. st. deviations and means stay the same, while kurtosis and skewness significantly improve. Also, this technique solves the problem with outliers (that are actually not).
thanks a lot for such a great solution!
Thank you. This saved my life. I had been struggling with numerous transformations, but did not work. It worked for me. I also used the 1-(1/n). You get a citation from me.
Wow so easy Loved your way of explanation simple and to the point
Hi, You didn't show that from where you got the series mean that copied and past from notepad and the standard deviation.
Hello Dr. Templeton,
I am a PhD student and have found that using the technique mentioned greatly improves the skewness and kurtosis value. However data is still not normally distributed. I have also tried log10 and loge transformations. Is there anything else that I can use? I dont have the option of dichotomizing the data. Please can I have the copy of your article for further details
Hi Kamalpreet, I got the same problem too. I applied Dr. Templeton's technique but my data is still skewed. Did you figure out how to transform the data into normal distribution? I want to perform a 3way anova so can't just use KW test. Appreciate if you could get back, thanks:)
@@tsedesiree Same here, hope you already found the answer and would share it with us all, looking forward. Thank you in advance
Hi, I'm also interested in normalizing a variabe. I have used: ln(x), log(10), 1/x, sqrt(x) and this method but nothing works.. I have heard about johnson transformation method. I haven't tried yet, but it said this method works almost always since it finds an optimal function that normalize your data. Let me try and I will tell you, If somebody knows how to use this method in spss please share the info =)
What mean and standard deviation are you using? It is not clear in the video.
Thanks Gary for the wonderful video and the article. I always have trouble when normalising the data since transformation like log doesn’t usually work.. but this is great .. simply wonderful. Thank u again
I'm glad it helped, Smruti.
Awesome to hear that it worked for you, Smruti. Good luck on your research.
Gary,
I ran several times the procedure on both SPSS and EXCEL using the same data set. Apparently, the outputs are inconsistent. Not sure what might cause the difference. I double checked the formula as well described on your paper.
Here is the excel formula:
To get the percent rank =IF(B4="","",IF(PERCENTRANK(B$2:B$50,B4)=1,0.9999,IF(PERCENTRANK(B$2:B$50,B4)=0,0.0001,PERCENTRANK(B$2:B$50,B4))))
To get the inverse of the Cumulative Normal Distribution =IF(B115="","",NORMINV(B115,0,1))
Running data set with replaced outliers with mean and on the original data produce some significant changes. So, replacing outliers with means doesn't look a reliable method to apply.
Now, I am thinking to Winsorize my original data? Do you have any recommendation on it to not miss any single outlier?
My data is both hugely negatively skewed and has outliers. They make it hard to figure what is the best way to do. I am think to Robust Statistics as well given my data. Any thought on that?
Huge thanks.
Loved your video!
Blessings and Love,
Dashama
Great video Gary, thank you. Just like below, I have some of my variables having an "out of range" error:
>At least one of the arguments to the IDF.NORMAL function is out of range. The
>first argument (probability) must be positive and less than one. The third
>argument must be positive. The result has been set to the system-missing
>value.
Why is this the case?
same :( can anyone help plsssssss
Thanks Gary for this absolutely great video.
I did all the steps several times and used my own data series' mean & standard deviation, his numbers , and 0,1 but every time this error appears:
>At least one of the arguments to the IDF.NORMAL function is out of range. The
>first argument (probability) must be positive and less than one. The third
>argument must be positive. The result has been set to the system-missing
>value.
what should I do?
when I used 0,1 and my numbers despite of this error series of new data appeared as normalized data but I don't know they are reliable or not...
Did you use three arguments? The syntax for the second step requires 1) the result of step 1 (this is in fractional or probability form, 2) mean, 3) standard deviation.
@@gftempleton thanks for your answer. yes I did. for both data series this error came up but also new columns was added to my spss worksheet! I'm gonna use them but considering these errors I don't know how reliable are they...
much appreciated Mr. Gary, it works perfectly well!
+mahirwe anthony
Great and good luck!
Thank you for your effort.
I would like to know how to achieve normality of several variables at once (not one by one).
Thanks for another time.
Hi Mr.Templeton. I have 8 groups of data to analyze. Some of them are normal and some of them are not. Should I implement your method for all of the groups in order to compare them in one -way Anova analysis? Please please help...
Thank you so much for this! I'm currently doing my dissertation and the non-normal data kind of shot me in the foot for the proposed analytical methods. Much appreciated!
Thanks Gary. But every time I run the trasnformation, this error appears.
>At least one of the arguments to the IDF.NORMAL function is out of range. The
>first argument (probability) must be positive and less than one. The third
>argument must be positive. The result has been set to the system-missing
>value.
What should I do?
Where do I copy the mean and the standard deviation?
Thanks in advance
Step 2 will not work on 0's or 1's. If the problem is a 0, convert it to 1/n as an estimate. If the problem is a 1, convert it using 1-(1/n) to estimate.
Gary, thank you so much. This is awesome. All other methods failed for my work. I really appreciate this and will of course cite you :)
I'm glad it worked for you, Aisha. Good luck on your research.
@@gftempleton Thank you so much but a lot of people are asking for the true of mean and STD mystery XD
@@alice-nckucsielee8265 I'm not sure I understand your question. Units are interpreted as "normalized x." I hope that helps.
@@gftempleton Hi Gary, thanks for replying. I meant the mean and the standard deviation we have to put into the quote at 1:38. What should we put into it? the original mean and STD or after step1 transformation mean and STD :)?
Thank you Gary, your tutorial is very clearly and helpful
Thanks, Francisco!
Thanks Gary Templeton for this informative video. After doing the two steps how can we interpret the output of the regression analysis
Not original units, but normalized units
Example
If you transform assets to normal and put it in an equation, it is interpreted as normalized assets
It's the same as with any transformation
Thanks Gary. One Question is about the shapiro test.Once I transformed my variables, all improved in terms of skewness and kurtosis. However, the shaprio-wilk test still shows non-normal distribution (p
Feel free to try other transformations (e.g., natural log, power transformations, truncating, winsorizing). However, that is time consuming. If you can find a statistical package that uses Box-Cox, which is tests many different power options, that may be a good use of time.
However, reviewers of your work could also tolerate that you attempted a normality transformation that improved the situation. Worst case, you'll have to use non-parametric procedures (which, coincidentally utilize transformations - usually ranking).
Thank you very much for such an instructional article and for the follow up video. I've been able to normalize my data following your method but still have a doubt.
I've a non-normally distributed variable "SOM" (metric) which assumes values for the years 2002 and 2012 (nominal variable named SAMPLE only assuming the value '1' for 2002 and '2' for 2012 collected samples).
I'm now able to 'globally' normalize my SOM variable with the 'Two-Step Transformation' BUT when I now do a ANALYZE --> DESCRIPTIVE STATISTICS --> EXPLORE with a split file or a 'Factor List' by the variable SAMPLE and re-analyze the normality tests I state that only the 2012 samples are normally distributed with the 2002 not being normally distributed and don't know how to resolve this. I'm stating this particular case but I also need to split my variable SOM even further (i.e. "date collected AND soil type" or "Date collected AND soil type AND cultivation system", etc). Or is this a non-issue because the 'global variable SOM' is now normally distributed?
I'm having this issue recursively and simply cannot find the answer to this problem. If you find the time to enlighten me on the issue it will help a lot. Thanks anyway for such a great transformation.
Hi Gary. I followed all steps but I got warning 4940 at least one of the argument in idf normal function is out of range! would you know why?
Hi Gary!
Thanks for this video.
Where did you get the value for the MEAN and STANDARD DEVIATION?
Two options: 1) the original variable mean and standard deviation or 2) 0 for mean, 1 for standard deviation (z-scores).
Once we have normality, how can I run a regression with the original data (taking into account the normalized data) so that I can use it in my predictive model?
Monica Alas hi monica,mind sharing how did u obtain the predictive model ? And the criteria taken into consideration like Correlation matrix,etc?
Hello, I followed the steps above. And one of Fractional rank value was 1, and it would be a missing data(no data shown) after the transformation .
I don’t know how to solve the problems 😅
Look forward to your reply!
I got that answer in previous comment! Thanks!
Hi Mr. Templeton, thank you for your transformation into normality method. What we should call this transformation method ?
Thanks for the video and the reference, Prof Templeton. When computing the Fractional Rank of some of my variables I end up having a value =1 (highest value on that variable), which then creates an "out of range" error on the IDF.Normal function as the range of values it accepts is 0 to less than 1. This does not happen with all variables, but just some. Any hints as to why this happens and how to address this in this transformation to normality would be appreciate it. Thanks
I too would like to know how to correct this problem, as this results in missingness in the data that I would like to avoid.
Hi Fidel Vila,
I think I've worked it out. I think there must be some rounding errors, which means that the probability (first argument) ends up being interpreted as being out of range (it has to be within 0 and 1).
I'm not sure how this happens (perhaps the calculations for the mean and SD need to be to more significant places, but I've tried this and it doesn't remove the error), but I have worked out a fix which is a bit of a "fudge":
Say you have a variable X to be normalised, with mean MEAN and standard deviation SD. Lets suppose you have conducted the fractional rank and made a variable RX. You then do the following:
>COMPUTE X_norm=IDF.NORMAL(RX/1.001,MEAN,SD).
>EXECUTE.
Dividing RX by 1.001 ensures that the variable is kept within the allowed range. (Although I repeat: I am not sure why it is interpreted as being out of range - as far as I can see, my variables all fall into 0 and 1, so it must be to do with rounding errors for the mean and SD calculations).
Hope this helps!
@@l.briant3537 Thank you SO SO much for this! I was genuinely despairing about the method not working, and your solution worked perfectly!!!
Thanks a lot for saving my say. But you did not mention initially that I would need a 1-(1/n) transformation before the final inversion. Thanks all the same.
Hi Gary. First, thanks for your informative video. I was dealing with a few very non-normal distributions, and this method worked wonderfully in normalizing the data. That said, I have one question for you, the answer to which I cannot seem to figure out. Namely, in the video description, you note that, "This approach retains the original series mean and standard deviation to improve the interpretation of results." However, I have not found this to be the case. Although the means and SDs for the transformed variables are quite similar to he original series' means and SDs, they are not perfectly retained. At least, this was true in cases where a value of 1 was generated after completing the first (fractional rank-order step), even when I used the formula, to replace the 1, that you mentioned in your response to a comment, below (i.e., replace the 1 with 1-(1/n)). Any clarification here would be much appreciated. Thanks again for your informative video.
Another reason they aren't exactly the original mean and standard deviation is because of inflated frequencies (stacks of the same value) that are some distance from the mean. If there were no 'same values" in the dataset, the resulting mean and standard deviation would be the exact mean and standard deviation and the original set. The approach "tries" to do that at least.
1-(1/n)) is a close approximation that allows researchers to lose a record. Consider it part of the procedure - just like the first two steps. You may be right, that may cause the mean and standard deviations parameters to vary slightly. I don't think it would affect interpretations much. Sample size is a bigger issue in a lot of cases. This should be up to the researcher to decide.
Thank you so much for this video!!! You saved my life! 감사합니다. Thanks again!
Glad to help - good luck!
You just saved me. Thank you!
Hello
My question is, i have a serie whose distribution does not follow the normal distribution, I tried the logarithmic transformation on Eviews but the p-value of jarque-bera is always lower than 0.05
So what transformation to do in Eviews?
Eviews has each step. The first is a fractional rank (rank represented in proportions) and the second is a normal inverse function. It appears "Normal (Gaussian)" is showing you this here:
www.eviews.com/help/helpintro.html#page/content/mathapp-Statistical_Distribution_Functions.html
Wow, Thanks Gary. This is a great method. I used Log10 and square root transformations to normalized the distribution of my data. None of them worked, and my data was still negatively skewed. I used this two-step transformation and it worked great. At first, I have a hard time finding the corresponding mean of the series and SD. After looking at the paper referenced, I found out that you have two choices. You can either put 0 as a mean and 1 as a SD for the arguments of the function (e.g. IDF.NORMAL(a new ranked series,0,1) or put the mean and SD of the original series to maintain the unit of data.
When I put the original mean and SD of the series, I worked just fine and data looked normal. However, when I replaced the parameters with 0 and 1, SPSS returned no value. Not sure why.
I also noticed that my sample size didn't reduce by 1 after the transformation. My sample size is small (50). Are you supposed to lose one sample after the transformation? Am I doing it in a wrong way not seeing this result?
So you went from standardized normal (0 and 1 parameters) to normalized (original mean and sd) and back to standardized? Did you save variables with the same names? That may be the problem. I've used the technique en masse and have never heard of that. It seems like you need to make sure you use unique variable names.
Low sample size would help retain all records. I don't know what the threshold is. Do you know what the fix is if it does become a problem? Replace the max value (1) in the results of Step 1 with 1-(1/n).
Given the problems with SPSS, I actually ended up using the excel formula you provided in the paper which I found much easier for data with many variables (33 in my case). It worked just fine and all my variables except one are normally distributed now (that non-normal variable looked normal too me, but Shapiro-Wilk test was significant after all ). I was able to retain all my sample (50) using the excel formula. What do you think about that?
I also went ahead and ran outlier testing giving a g value of 2.2. To my surprise, most variables have at least two outliers. That's surprising as I replaced all outliers with the mean from the original data before running the transformation in excel. The g factor 2.2 is usually considered pretty large to retain almost all data. That's very interesting to see such a pattern. But I still trust this data with outliers better than my original non-normally distributed data unless I am missing something significant here.
I perform the Two-Step in Excel often - there is no difference as far as I can tell - the same two steps are readily available.
I would never replace an outlier with the mean! You're changing the data and may be suppressing or hiding results. If your data is sufficiently normal, don't worry about outliers. And, your data doesn't have to be perfectly normal. If it is terribly non-normal and you use non-parametrics, your data isn't normal anyway. For example, if you use Spearman's rank, the data is transformed to uniform (not normal).
Thanks for taking time and providing the great pointers. The scale of my data is continuous. I don't remember where I saw, but replacing outliers with mean was explained as a reliable method.
Unfortunately, my data is pretty skewed, having skewness value of .9 or something like that. My sample size is also pretty small (45), so, not missing even one sample is important.
What I would do, I would put back all outliers in their place and run everything all again. That's a pain, but I am curious to see the difference.
Gary,
I ran several times the procedure on both SPSS and EXCEL using
the same data set. Apparently, the outputs are inconsistent. Not sure
what might cause the difference. I double checked the formula as well
described on your paper.
Here is the excel formula:
To get
the percent rank
=IF(B4="","",IF(PERCENTRANK(B$2:B$50,B4)=1,0.9999,IF(PERCENTRANK(B$2:B$50,B4)=0,0.0001,PERCENTRANK(B$2:B$50,B4))))
To get the inverse of the Cumulative Normal Distribution =IF(B115="","",NORMINV(B115,0,1))
Running
data set with replaced outliers with mean and on the original data
produce some significant changes. So, replacing outliers with means
doesn't look a reliable method to apply.
Now, I am thinking to Winsorize my original data? Do you have any recommendation on it to not miss any single outlier?
My
data is both hugely negatively skewed and has outliers. They make it
hard to figure what is the best way to do. I am think to Robust
Statistics as well given my data. Any thought on that?
Huge thanks.
Where do you get the series mean from and the standard deviation? Please can anyone help!
Calculate the mean (average) and standard deviation from the original data. Use those in the second step if you want to approximate original units.
Thanks Gray. How to interpret coefficients after converting dependent variable using IDF.normal function? for example, if one unit increase in the independent variable, how it is affecting the dependent variable?
Thanks,
Assuming you transform using the series mean and standard deviation, Interpret exactly the same as you would original units. I would note that you normalized the original units.
Alternatively, you can transform using mean=0 and sd=1 and interpret as standardized normal original units.
Thanks, Gary, it is very helpful what you have done. Are you maybe aware, if there's any critical peer review or papers out there regarding this method of yours?
Thanks, for the answer.
This paper has been peer reviewed. It will be published in print in early August:
aaajournals.org/doi/abs/10.2308/isys-51510?code=aaan-site
Thank you! Very useful and clearly explain.
Thank your for sharing this video. Can I ask a question. How is the first step related to the second step?
THANK YOU, Mr. Gary
We make transformation only for dependent variable? Or for all variables of our model?
There is no rule when you are trying to satisfy the assumptions of the test. Only you should report all procedures.
@@gftempleton Thank you
Dr Templeton - the mean and SD where come from? Do you get it from the original data (not normal one)
Yes - both the mean and SD come from original data.
This is an amazing method. I'm wondering if there's an added value to winsorizing or otherwise capping variables before the transformation. Some of the variables I have clinical variables for which have a case or two have extreme outliers and are also non-normal. Using the means and standard deviations for these variables seems a little weird to me because the Ms and SDs before winsorizing don't seem within the range of values usually seen in my patient population. If I winsorize before transforming, the Ms and SDs seem a little more representative... Am I completely off here?
Thank you very much for wonderful method, may I ask a question please. I have the data that including 2 group in 1 variable when I use this method, should I split file for separate group, because when I compare mean difference by T-test statistic the result will be changing? For example I have walking speed for 1500 persons this data incliding heart disease person (N=245)and non-heart disease person. When I fix to normal distribution can I fix in one time (N=1500) or I should split file to heart disease and non heart disease group, both of data not normal distribution. I look forward to hearing from you soon. thank you very much.
I'm not 100% sure if I understand your question. Assuming you will be standardizing the data along with the Two-Step (mean=0, sd=1)...if you want to test difference between two groups, I would stack them all together and transform one time.
If you use original series mean and standard deviation, you'd have to compute the values for both groups, then normalize separately.
@@gftempleton thank you very much for your suggession.
Very helpful,Thank you
Dear Gary,
Can you please tell me what are the implications of using this technique on likert scale. for instance, I have used a likert scale in which 1 is strongly disagree and 7 being strongly agree. Does it inverse the relation or what ? Thanks
this is not for Lickert Scale. How to transform data of Lickert Scale.please help
In case we need to describe this procedure in data analysis or results, what should we mention exactly along with the reference?
If you are asking about units, just say the results are in normalized units. Of course, you would explain you used the Two-Step in the methods. Models using original and transformed (e.g., natural log or Two-Step) are separate models (i.e., different error terms) and should be interpreted differently (this is not so obvious to some but models are commonly treated distinctly).
Regarding the method, step 1 is simply a fractional rank and step 2 is the application of the inverse normal function applied to the results of step 1.
Gary Templeton THANK YOU SO MUCH, indeed.
Thank you so much for you easy and helpful explanation! You really saved my life (and thesis, which are the same thing right now) :P
Thank you for posting it; then Which method is better to normalize data ?; and what if all methods (log; ln; sqrt; trunc fail to normalize my data?
Very useful information but I'm getting lost where you are copying the mean and standard deviation as such I'm stuck. Kindly help where to copy the mean and the standard deviation. You only mentioned that you copy from your notepad, but what about me, where do I copy from? I'm stuck, someone help asap please.
It's from the original data
My question is the same like Shah 4 years ago where did you get the values you included series mean and standard deviation
From the original data.... Mean and SD
Thank you and could you specify the series mean and std. dev are of the original var (i.e.: the market cap), right?
Thank you for the video and your paper! I used the two-step method for my non-normal data, and all turned to normal distributions! The only concern remained is that if I am allowed to use this method for my data, which are drawn based on 4- and 5-point Likert scales??! I read at your article to use this method mostly for higher levels (up to 100)! I would appreciate it if you could tell me whether I can use this method for 4- and 5-point Likert scales or not! Thanks in advance!
Thanks Gary
It is just awesome. I have one inquiry: How can I transform back from the step 2 to the original data? For example, After I did the two steps I got I mean of (-0.007).So, How am I going to report that?!
Would using this method after the fact be considered a linear-linear regression, a log-log or what? Also, would transforming the variables post-processing back to say a scale from one to ten be considered good practice in terms of easing interpretation?
Hi, thanks for the share. I tried the method, and it works to normalize the dataset, however, why the sample size is reduced after the procedure? For example, why the sample size reduced from 6843 to 6842 above? Would that affect the conclusion?
Please give us a reasonable answer why the sample is decreasing after two-step process. Why the missing figure is coming? How can we interpret this problem in research paper? Thank you in advance.
i saw this video before and it was helpful i followed your steps and the results as it occured i posted them to you in the first comment
i posted the result at your friend's channel James Gaskin and he recommended this video for me pleas help i don't know what to do:(
Qno.1
I have 3 dependent variables. Two of them are in range of normal skewness value i.e. +1 to -1 and have kurtosis in range of +3 to -3, but the third remaining dependent variable is not in normal range of skewnes or kurtosis. I want to transform that variable with square root transform to run parametric tests. So the question is, Can I transform that one variable only and run parametric test on the variables or I should transform all three variables before doing test? should I transform all three variables together even the two of them are already normally distributed? will it create problems to transform only one non normal variable?
q.no.2
Can I infer and interpret my data for normality on the basis of skewness and kurtosis only rather than gooing for shapiro wilk test?
Thanks Gary.. Very much appreciated
Thank you so much for the method. I can not nomalize my data. Can I have the data set you used?
Thanks Gary,
This youtube for transform continuous variables toward normality. Can I use this technique for Likert-scale variables
Regards
The technique will transform any variable toward normality, except for binary variables. That being said, there will be a variety of results depending on the situation. I have experienced positive, yet less beneficial results when applying the two-step to Likert scale data. While ORIGINAL data based on Likert-based items are often significantly non-normal, summing the items usually results in fairly good normality. Because non-normality isn't usually a terrible problem with Likert-based data, transforming toward normal doesn't help that much.
On the other hand, transforming highly continuous data, especially ratios, will often yield tremendous downstream benefits in scientific testing.
Good luck!
You are so nice by being responsive Mr. Gary :) so can you please help how can we transform non-normal data of Lickert Scale to normal.
Great technique, indeed! Thanks alot. Is that the mean and std deviation of the original variable?
That is an option, Maurad. The other basic option is mean=0, sd=1 to make the variable standard-normal.
Gary Templeton Thank you.
Why do we need to do 2 steps? Can't I just use the fractional rank? For example, my BMI variable was skewed and we wanted to do a GEE with that. Can't I just use the fractional rank as my new BMI?
Can these steps be used after taking ordinal questions and converting them to scale in SPSS?
Thank you for this helpful video.
I was wondering.... what is the use of the bootstrap option in Amos ? is it better to perform a transformation or is bootstrap sufficient when performing a CFA ?
+Miriam Roussel This would make the subject of a good research paper. I believe bootstrapping has its own weaknesses, as does any transformation. I prefer using normalized, real data.
Hi, I used the same method on all three of my non-normal variables and it turned out fine for two only. Are there any assumptions or characteristics of the data needed to successfully transform to normality? I tried the other ways (log10 etc. ) of transforming too, but I still can't get normal distribution for that particular variable. Do you have any recommendations that I can try for normality? TQ
Yishan - for the one variable that won't transform to normal, the problem may be the characteristics of the original distribution. If you don't have many levels (e.g., binary=2 levels) or if there is an "inflated frequency" (e.g., stacks of zeroes or other values), then it won't transform to normal regardless of what method you use. I would do the best you can, tell the reviewers what you did, and hope for the best.
Thanks for the clarification. That variable is measuring participant's depressive score, and there are 120 cases with zero value out of 1000+ cases. I'm trying to remain all data as I think zero score can be meaningful in the research. By using the 'two-step approach', I was able to reduce the skewness from 1.312 to 0.184 (std error= .076), which later obtained Z value within the range of +-2.58. Despite the 'non-normal' shape of histogram and significant p value of Kolmogorov-Smirnov test, can I still assume that data has met the assumption of normality (based on Z value) and proceed with the transformed variable?
Hi, thanks for this video. I do have a clarification question. I'm not sure what to use as second and third arguments in the Idf.Normal function. In your 2011 paper, I read that, by default, one can use 0 and 1, respectively, for the desired mean and standard deviation but, in the video, you mention a series' mean and standard deviation. Is it the mean and standard deviation of the (fraction rank) transformed variable?
OK. Got it. I found the answer in the video at 2'34.
@@nicolasvanderlinden8569 so we should use the numbers that he uses instead of our own mean and standard deviation? what about 0 & 1? I'm so confused...
Hi,
I just have one doubt if we want to convert back to absolute value how can we we do that.. for example i have regression model and i converted the dependent variable and now i want to see what will be the absolute value of y
It depends on the units you use. There are two basic uses of the Two-Step: 1) convert to standardized units (use mean=0 and sd=1 in the second step) or 2) convert to normalized original units (use original series mean and sd). So, the interpretation depends on usage. If you use the second step, there is no reason to convert back as you are in the original units (just normalized).
Gary Templeton thanks a lot
Just one more question to be sure, so after applying this method for transformation the regression equation remains same i.e y=b0+x1*b1+x2*b3 and it doesn't change like it changes when we do log transformation.
Reverting back isn't necessary if you transform using the original series mean and standard deviation. You are already in the original units.
Also, remember that using the exponential function to revert back to original units from logged units is problematic when some original values are negative. In that case, the natural log would produce missing values. To avoid this, researchers will shift the values so none of them are negative, then do the natural log transformation. This means reverting back using the exponential is useless, unless the preconditioning is reversed appropriately. The natural log has many flaws and is inferior to the Two-Step in achieving normality and achieving significant results. See:
Templeton, G.F. and Burney, L. 2017
. “Using a Two-Step Transformation to Address Non-Normality from a Business Value of Information Technology Perspective,” Journal of Information Systems, Vol. 31, No. 2, pp. 149-164.
I am really thank full for your response.. Thanks a lot
Thank you so much for this very informative video. I have one question. I used this method to transform my metabolome data so it would be normally distributed. Then I performed several analyses, including a linear regression model. How would I interpret the Beta from this analysis? Or is there a way to transform this Beta back to its previous "value". I hope to hear from you!
It is in "normalized" units. Just report your transformation and consider the model with transformed units to be separate from not transformed.
Regarding transforming back to previous value, I'm not an advocate of that since the model with transformed units satisfies test assumptions differently than original data. The only reason to do the transformation is to satisfy test assumptions. So, I don't agree you can transit back and forth between the two models. They are different. Perhaps you can report both. I don't believe in doing it with log transformed data either.
@@gftempleton Thank you very much for your quick response. As I look back to one of your previous answers, can I then assume that I still have the same units after using the transformation with series mean and SD? So my estimate of say 1.55 is 1.55 mmol/l (for example). No need to transform it back?
Previous answer: "It depends on the units you use. There are two basic uses of the Two-Step: 1) convert to standardized units (use mean=0 and sd=1 in the second step) or 2) convert to normalized original units (use original series mean and sd). So, the interpretation depends on usage. If you use the second step, there is no reason to convert back as you are in the original units (just normalized)."
@@joellevergroesen1050 Not exactly the same units. Instead of "X," it would be "Normalized X." Transiting back has its issues, mainly that the distributions (or error distributions in multivariate methods) have different distributions and therefore the two measures sit in different contexts. It is better, in my opinion, to view two models with different variables as being uniquely different.
Yes it is a great explanation but i have try on my own data but unable to get normality..i've used log10 and sqrt. the results still the same...a bit changes but no changes on normality. what to do. ple advice.tq
This is great!!
how do we get the value back from transformed data i.e after i perform transformation i do regression using normal value now after the result i need to know how to get the actual data from transformed data. For logarithmic transform we use the base to get the value back how do we do it here
Thank you soooooooo much! Particularly for the decent reference. Couldn't download the article, though, any other sites I can take it from?
Can you perform this method in stata, please? Thanks
From where you wrote the value for the second question mark (?) and from the third question mark (?)
First ? is the mean, second is the standard deviation
Great one!!!
Kindly discuss respected sir how to put value of 2nd ? And 3rd ?.
wow nice techniques! Thanks! I have one question. Can I use this as for an already normal data because I am using paired sample t test and i think that this should be applied in both data for equal comparisons. And another more, does this techniques is great for observation? because it seems so perfect. Thanks!
Thanks a ton for this! This was of great help! I had a quick question, and I would be really grateful if you could help me out with this.
Tried this on 7 of my variables. 4 of them got transformed, but 3 of them still haven't. Does this mean that they cannot be transformed or is there another way to do this? The sample size is 50, and I am using the Q-Q plot and Shapiro Wilk to test for normality.
Thanks in advance, Any help would be greatly appreciated. Thanks again!
hello, i have a problem i need a help with.Is the process of removing outliers from a variable more than one time considered manipulating or changing the data?i have loans for public. its mean .17093 st.dv .955838 skewness 7.571 kurtosis 61.436 most of the cases of this loan is an outliers after several times of ranking and replacing the missing values with the mean i reach this output mean .2970 stdv .22582 skewness 2.301 kurtisos 3.885 and it ends ub to be positively skewed.
i dont know what to do shall i keep it this way or take thevery first one or do i have to continue knowing that the percentiles 5, 10, 25,50 and 75 ends up with the same number.2072. And i still have to do the regression please help:(
Very helpful video!
thank you so much professor: Gray its amazing video, but i wanna ask a question to you to ensure it and used this method in my paper, if i fitted a gamma model and extract the (Residuals) ind i wanna use this residuals but its non normal distribution if i follow up this steps in this video and changes the values of the data its applicable to use it in my paper as the same residuals ??
Why not? Use it as you would any other transformation.
Hello,
I have 4 non-normal variable in my dataset. Do I need to individually perform these steps for each of 4 non-normal variables? or is there any other method??
This two-step process has three step major issues.
One: by using ‘Function group,’ Inverse DF,’ and data from notepad such as ‘series mean and SD’ you transformed the data. Can we use 0 and 1 as series mean and SD as you claim in your paper?
Two: after transforming, you will reduce sample size by 1. That means if you have five
Three: We have another major issue with this transformation. Now, you can do statistical tests on the transformed data and here is the big problem. Reporting mean and standard deviation in the 'transformed unit' is not the purpose of almost every research.
How do you back-transform your result after this ‘two-step transformation’ to explain the results in original data? If you cannot back-transform the results, this is not acceptable in research. Please explain how to deal with all these three issues, thank you.
The three items you list are in no way "major issues:" 1) simply standardizes the data. Another option is to use the original series mean and standard deviation; 2) there is a simple fix - impute missing value using 1-(1/n); 3) this is an issue with every transformation, including the natural log and probabilities; the Two-Step is the only approach that allows the researcher to use the original mean and standard deviation as arguments to the result will emulate the original units.
First of all thank you very much for this approach saves me a lot of time and effort.
My question is: I have a dependent variable measuring "click intention" which can be measured from 0 to 100. After normalizing the data however I get 3 negative results and 2 above 100. Is it acceptable to keep it this way?
Thank you very much!
How do i write an equation using IDF normal function: For log i can use -- Returns = a + bo log (Beta) + b1 log (Leverage). How do i write equation using this function?
I personally use "TS" as in...
TS(Beta)
TS(Leverage)
@@gftempleton ok thank you
@@chetanasanghavi1576 You can also look in articles that published using the procedure:
scholar.google.com/scholar?oi=bibs&hl=en&cites=6281913823923514896