Which normality test to use
Earlier versions of Prism offered only the Kolmogorov-Smirnov test. We still offer this test for consistency but no longer recommend it.
It computes a P value from a single value: the largest discrepancy between the cumulative distribution of the data and a cumulative Gaussian distribution. This is not a very sensitive way to assess normality, and we now agree with this statement 1 : " The Kolmogorov-Smirnov test is only a historical curiosity. It should never be used. Note that both this test and the Anderson-Darline test compare the actual and ideal cumulative distributions.
The distinction is that Anderson-Darling considers the discrepancies at all parts of the curve, and Kolmogorov-Smirnov only look at the largest discrepancy. The Kolmogorov-Smirnov method as originally published assumes that you know the mean and SD of the overall population perhaps from prior work. When analyzing data, you rarely know the overall population mean and SD. You only know the mean and SD of your sample. To compute the P value, therefore, Prism uses the Dallal and Wilkinson approximation to Lilliefors' method 3.
Leave the above options unchanged and click on the button. Click on the button. Join the 10,s of students, academics and professionals who rely on Laerd Statistics. One of the reasons for this is that the Explore When testing for normality, we are mainly interested in the Tests of Normality table and the Normal Q-Q Plots , our numerical and graphical methods to test for the normality of data, respectively.
The above table presents the results from two well-known tests of normality, namely the Kolmogorov-Smirnov Test and the Shapiro-Wilk Test.
For this reason, we will use the Shapiro-Wilk test as our numerical means of assessing normality. We can see from the above table that for the "Beginner", "Intermediate" and "Advanced" Course Group the dependent variable, "Time", was normally distributed. How do we know this? If the Sig. If it is below 0. If you need to use skewness and kurtosis values to determine normality, rather the Shapiro-Wilk test, you will find these in our enhanced testing for normality guide.
The two-stage procedure might be considered incorrect from a formal perspective; nevertheless, in the investigated examples, this procedure seemed to satisfactorily maintain the nominal significance level and had acceptable power properties. Peer Review reports. Statistical tests have become more and more important in medical research [ 1 — 3 ], but many publications have been reported to contain serious statistical errors [ 4 — 10 ].
In this regard, violation of distributional assumptions has been identified as one of the most common problems: According to Olsen [ 9 ], a frequent error is to use statistical tests that assume a normal distribution on data that are actually skewed. With small samples, Neville et al. Similarly, Strasak et al. Probably one of the most popular research questions is whether two independent samples differ from each other.
The test assumes independent sampling from normal distributions with equal variance. If the assumptions are violated, T is compared with the wrong reference distribution, which may result in a deviation of the actual Type I error from the nominal significance level [ 12 , 13 ], in a loss of power relative to other tests developed for similar problems [ 14 ], or both.
In medical research, normally distributed data are the exception rather than the rule [ 15 , 16 ]. In such situations, the use of parametric methods is discouraged, and nonparametric tests which are also referred to as distribution-free tests such as the two-sample Mann—Whitney U test are recommended instead [ 11 , 17 ].
Guidelines for contributions to medical journals emphasize the importance of distributional assumptions [ 18 , 19 ]. Sometimes, special recommendations are provided. When addressing the question of how to compare changes from baseline in randomized clinical trials if data do not follow a normal distribution, Vickers, for example, concluded that such data are best analyzed with analysis of covariance [ 20 ]. In clinical trials, a detailed description of the statistical analysis is mandatory [ 21 ].
This description requires good knowledge about the clinical endpoints, which is often limited. Researchers, therefore, tend to specify alternative statistical procedures in case the underlying assumptions are not satisfied e.
For the t test, Livingston [ 23 ] presented a list of conditions that must be considered e. Consequently, some researchers routinely check if their data fulfill the assumptions and change the analysis method if they do not for a review, see [ 24 ].
In a preliminary test, a specific assumption is checked; the outcome of the pretest then determines which method should be used for assessing the main hypothesis [ 25 — 28 ].
For the paired t test, Freidlin et al. Such a two-stage procedure Additional file 1 appears logical, and goodness-of-fit tests for normality are frequently reported in articles [ 32 — 35 ]. Some authors have recently warned against preliminary testing [ 24 , 36 — 45 ]. First of all, theoretical drawbacks exist with regard to the preliminary testing of assumptions.
The basic difficulty of a typical pretest is that the desired result is often the acceptance of the null hypothesis. In practice, the conclusion about the validity of, for example, the normality assumption is then implicit rather than explicit: Because insufficient evidence exists to reject normality, normality will be considered true.
Further critiques of preliminary testing focused on the fact that assumptions refer to characteristics of populations and not to characteristics of samples. In particular, small to moderate sample sizes do not guarantee matching of the sample distribution with the population distribution. For example, Altman [ 11 ], Figure 4.
Second, some preliminary tests are accompanied by their own underlying assumptions, raising the question of whether these assumptions also need to be examined. In addition, even if the preliminary test indicates that the tested assumption does not hold, the actual test of interest may still be robust to violations of this assumption. Finally, preliminary tests are usually applied to the same data as the subsequent test, which may result in uncontrolled error rates. For the one-sample t test, Schucany and Ng [ 41 ] conducted a simulation study of the consequences of the two-stage selection procedure including a preliminary test for normality.
Data were sampled from normal, uniform, exponential, and Cauchy populations. For exponentially distributed data, the conditional Type I error rate of the main test turned out to be strikingly above the nominal significance level and even increased with sample size. Zimmerman concluded that choosing the pooled or separate variance version of the t test solely on the inspection of the sample data does neither maintain the significance level nor protect the power of the procedure.
Rasch et al. Interestingly, none of the studies cited above explicitly addressed the unconditional error rates of the two-stage procedure as a whole. The studies rather focused on the conditional error rates, that is, the Type I and Type II error of single arms of the two-stage procedure.
Similar to Schucany and Ng [ 41 ], the tests to be applied were chosen depending on the results of the preliminary Shapiro-Wilk tests for normality of the two samples involved. We thereby obtained an estimate of the conditional Type I error rates for samples that were classified as normal although the underlying populations were in fact non-normal, and vice-versa. This probability reflects the error rate researchers may face with respect to the main hypothesis if they mistakenly believe the normality assumption to be satisfied or violated.
If, in addition, the power of the preliminary Shapiro-Wilk test is taken into account, the potential impact of the entire two-stage procedure on the overall Type I error rate and power can be directly estimated.
In our simulation study, equally sized samples for two groups were drawn from three different distributions, covering a variety of shapes of data encountered in clinical research. Two selection strategies were examined for the main test to be applied. The difference between the two strategies is that, in Strategy I, the Shapiro-Wilk test for normality is separately conducted on raw data from each sample, whereas in Strategy II, the preliminary test is applied only once, i.
Statistical language R 2. The conditional Type I errors rates left arm of the decision tree in Additional file 1 were then estimated by the number of significant t tests divided by 10, Finally, , pairs of samples were generated from exponential, uniform, and normal distributions to assess the unconditional Type I error of the entire two-stage procedure.
The Type I error rate of the entire two-stage procedure was estimated by the number of significant tests t or U and division by , This strategy was motivated by the well-known assumption that the two-sample t test requires data within each of the two groups to be sampled from normally distributed populations e. Table 1 left summarizes the estimated conditional Type I error probabilities of the standard two-sample t test i.
Figure 1 additionally plots the corresponding estimates if the underlying distribution was either A exponential, B uniform, or C normal. As can be seen from Table 1 and Figure 1 , the unconditional two-sample t test i. In contrast, the observed conditional Type I error rates differed from the nominal significance level. If the underlying distribution was uniform, the conditional Type I error rates declined below the nominal level, particularly as samples became larger and preliminary significance levels increased Figure 1 B.
For normally distributed populations, conditional and unconditional Type I error rates roughly followed the nominal significance level Figure 1 C. Samples of equal size from the A exponential, B uniform, and C normal distribution. The estimated conditional Type I error probabilities are summarized in Table 1 right : For exponential samples, only a negligible tendency towards conservative decisions was observed, but samples from the uniform distribution, and, to a lesser extent, samples from the normal distribution proved problematic.
In contrast to the pattern observed for the conditional t test, however, the nominal significance level was mostly violated in small samples and numerically low significance levels of the pretest e.
The two-sample t test is a special case of a linear model that assumes independent normally distributed errors. Therefore, the normality assumption can be examined through residuals instead of raw data. In linear models, residuals are defined as differences between observed and expected values.
In the two-sample comparison, the expected value for a measurement corresponds to the mean of the sample from which it derived, so that the residual simplifies to the difference between the observed value and the sample mean. In regression modeling, the assumption of normality is often checked by the plotting of residuals after parameter estimation.
However, this order may be reversed, and formal tests of normality based on residuals may be carried out. In Strategy II, one single Shapiro-Wilk test was applied to the collapsed set of residuals from both samples; thus, in contrast to Strategy I, only one pretest for normality had to be passed. Thus, if the underlying distribution was normal, the preliminary Shapiro-Wilk test for normality of the residuals did not affect the Type I error probability of the subsequent two-sample t test.
For the two other distributions, the results were strikingly different. For samples from the exponential distribution, conditional Type I error rates were much larger than the nominal significance level Figure 2 A. The conditional Type I error rate increased again with growing sample size and increasing preliminary significance level of the Shapiro-Wilk test. Biased decisions within the two arms of the decision tree in Additional file 1 are mainly a matter of theoretical interest, whereas the unconditional Type I error and power of the two-stage procedure reflect how the algorithm works in practice.
Therefore, we directly assessed the practical consequences of the entire two-stage procedure with respect to the overall, unconditional, Type I error. This evaluation was additionally motivated by the anticipation that, although the observed conditional Type I error rates of both the main parametric test and the nonparametric test were seriously altered by screening for normality, these results will rarely occur in practice because the Shapiro-Wilk test is very powerful in large samples.
Again, pairs of samples were generated from exponential, uniform, and normal distributions. Table 3 outlines the estimated unconditional Type I error rates. In line with this expectation, the results show that the two-stage procedure as a whole can be considered robust with respect to the unconditional Type I error rate. This holds true for all three distributions considered, irrespectively of the strategy chosen for the preliminary test.
Because the two-stage procedure seemed to keep the nominal significance level, we additionally investigated the corresponding statistical power. To this end, , pairs of samples were drawn from unit variance normal distributions with means 0. Similar results were observed for shifted uniform distributions and exponential distributions with different rate parameters: In both distributions, the overall power of the two-stage procedure seemed to lie in-between the power estimated for the unconditional t test and the U test.
The appropriateness of a statistical test, which depends on underlying distributional assumptions, is generally not a problem if the population distribution is known in advance. If the assumption of normality is known to be wrong, a nonparametric test may be used that does not require normally distributed data. Difficulties arise if the population distribution is unknown—which, unfortunately, is the most common scenario in medical research. Many statistical textbooks and articles state that assumptions should be checked before conducting statistical tests, and that tests should be chosen depending on whether the assumptions are met e.
Various options for testing assumptions are easily available and sometimes even automatically generated within the standard output of statistical software e.
0コメント