Tests of Significance (2024)

Once sample data has been gathered through an observational study or experiment, statistical inference allows analysts to assess evidence infavor or some claim about the population from which the sample has been drawn. The methods of inference used to support or reject claims based onsample data are known as tests of significance.

Every test of significance begins with a null hypothesis H₀.H₀ represents a theory that has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, buthas not been proved. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug. Wewould write H₀: there is no difference between the two drugs on average.

The alternative hypothesis, H_a, is a statement of what a statistical hypothesis test is set up to establish. For example, in a clinical trial of a new drug, the alternative hypothesis might be that the new drug has a different effect, on average, compared to that of the current drug. We would write H_a: the two drugs have different effects, on average. The alternative hypothesis might also be that the new drug is better, on average, than the current drug. In this case we would write H_a: the new drug is better than the current drug, on average.

The final conclusion once the test has been carried out is always given in terms of the null hypothesis. We either "reject H₀ in favor of H_a" or "do not reject H₀"; we never conclude "reject H_a", or even "accept H_a".

If we conclude "do not reject H₀", this does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence against H₀ in favor of H_a; rejecting the null hypothesis then, suggests that the alternative hypothesis may be true.

(Definitions taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

Hypotheses are always stated in terms of population parameter, such as the mean .An alternative hypothesis may be one-sided or two-sided. A one-sided hypothesis claims that a parameter is either larger or smaller than thevalue given by the null hypothesis. A two-sided hypothesis claims that a parameter is simplynot equal to the value given by the null hypothesis -- the direction does not matter.

Hypotheses for a one-sided test for a population mean take the following form:
H₀: = k
H_a: > k
or
H₀: = k
H_a: < k.

Example

Suppose a test has been given to all high school students in a certain state. The mean test scorefor the entire state is 70, with standard deviation equal to 10. Members of the school boardsuspect that female students have a higher mean score on the test than male students, because themean score

from a random sample of 64 female studentsis equal to 73. Does this provide strong evidence that the overall mean for female students is higher?

The null hypothesis H₀ claims that there is no difference between the meanscore for female students and the mean for the entire population, so that = 70. The alternative hypothesis claims that the mean forfemale students is higher than the entire student population mean, so that > 70.

Significance Tests for Unknown Mean and Known Standard Deviation

Once null and alternative hypotheses have been formulated for a particular claim, the next stepis to compute a test statistic. For claims about a population mean from a population with a normal distribution or for any sample with large sample size n (for which the sample mean will follow a normal distribution by the Central Limit Theorem), if the standard deviation isknown, the appropriate significance test is known as the z-test, where the teststatistic is defined as z = .

The test statistic follows the standard normal distribution (with mean = 0 and standard deviation= 1). The test statistic z is used to compute the P-value for the standard normal distribution, the probability that a value at least as extreme as the test statistic would be observed under the null hypothesis. Given the null hypothesis that the population mean is equal to a given value ₀, the P-values for testing H₀against each of the possible alternative hypotheses are:
P(Z > z) for H_a: > ₀
P(Z < z) for H_a: < ₀
2P(Z>|z|) for H_a: ₀.

The probability is doubled for the two-sided test, since the two-sided alternative hypothesis considers the possibility of observing extreme values on either tail of the normal distribution.

Example

In the test score example above, where the sample mean equals 73 and the population standarddeviation is equal to 10, the test statistic is computed as follows:
z = (73 - 70)/(10/sqrt(64)) = 3/1.25 = 2.4. Since this is a one-sided test, the P-valueis equal to the probability that of observing a value greater than 2.4 in the standard normaldistribution, or P(Z > 2.4) = 1 - P(Z < 2.4) = 1 - 0.9918 = 0.0082. The P-value is less than 0.01, indicating that it is highly unlikely that these results wouldbe observed under the null hypothesis. The school board can confidently reject H₀ given this result, although they cannot conclude any additional information about the mean of the distribution.

Significance Levels

The significance level for a given hypothesis test is a value for which a P-value less than or equal to

is considered statistically significant. Typical values for

are 0.1, 0.05, and 0.01. These values correspond to the probability of observing such an extreme value by chance. In the test score example above, the P-value is 0.0082, so the probability of observing such a value by chance is less that 0.01, and the result is significant at the0.01 level.

In a one-sided test, corresponds to the critical valuez^* such that P(Z > z^*) = .For example, if the desired significance level for a result is 0.05, the corresponding valuefor z must be greater than or equal to z^* = 1.645 (or less than or equal to -1.645 for a one-sided alternative claiming that the mean is less than the null hypothesis). For a two-sided test, we are interested in the probability that 2P(Z > z^*) = , so thecritical value z^* corresponds to the /2 significance level. To achieve a significance level of 0.05 for a two-sided test, the absolute value of the test statistic (|z|) must be greater than or equal to the critical value 1.96 (which corresponds to the level 0.025 for a one-sided test).

Another interpretation of the significance level , based in decision theory, is that corresponds to the value for which one chooses to reject or accept the null hypothesis H₀. In the above example, the value 0.0082 would result in rejection of thenull hypothesis at the 0.01 level. The probability that this is a mistake -- that, in fact,the null hypothesis is true given the z-statistic -- is less than 0.01. In decision theory,this is known as a Type I error. The probability of a Type I error is equal tothe significance level , and the probability of rejectingthe null hypothesis when it is in fact false (a correct decision) is equal to 1 - . To minimize the probability of Type I error, the significancelevel is generally chosen to be small.

Example

Of all of the individuals who develop a certain rash, suppose the mean recovery time forindividuals who do not use any form of treatment is 30 days with standard deviation equal to 8. A pharmaceutical company manufacturing a certain cream wishes to determine whether the cream shortens, extends, or has no effect on the recovery time. The company chooses a random sampleof 100 individuals who have used the cream, and determines that the mean recovery time forthese individuals was 28.5 days. Does the cream have any effect?

Since the pharmaceutical company is interested in any difference from the mean recoverytime for all individuals, the alternative hypothesis H_a is two-sided: 30. The test statistic is calculatedto be z = (28.5 - 30)/(8/sqrt(100)) = -1.5/0.8 = -1.875. The P-value for this statistic is 2P(Z > 1.875) = 2(1 - P((Z < 1.875) = 2(1- 0.9693)= 2(0.0307) = 0.0614. This is not significant at the 0.05 level, although it is significantat the 0.1 level.

Decision theory is also concerned with a second error possible in significance testing, known as Type II error. Contrary to Type I error, Type II error is theerror made when the null hypothesis is incorrectly accepted. The probability of correctly rejecting the null hypothesis when it is false, the complement of the Type II error, is known as the power of a test. Formally defined, the power of a test is the probability that a fixed level significance test will reject the null hypothesis H₀ when a particular alternative value of the parameter is true.

Example

In the test score example, for a fixed significance level of 0.10, suppose the school boardwishes to be able to reject the null hypothesis (that the mean = 70) if the mean for femalestudents is in fact 72. To determine the power of the test against this alternative, firstnote that the critical value for rejecting the null hypothesis is z^* = 1.282. The calculated value for z will be greater than 1.282 whenever (

- 70)/(1.25) > 1.282, or

> 71.6. The probability of rejecting the null hypothesis (mean = 70) given that the alternative hypotheses (mean = 72) is true is calculated by:
P(( > 71.6 | = 72)
= P(( - 72)/(1.25) > (71.6 - 72)/1.25)
= P(Z > -0.32) = 1 - P(Z < -0.32) = 1 - 0.3745 = 0.6255.The power is about 0.60, indicating that although the test is more likely than not to reject the null hypothesis for this value, the probability of a Type II error is high.

Significance Tests for Unknown Mean and Unknown Standard Deviation

In most practical research, the standard deviation for the population of interest is not known.In this case, the standard deviation

is replaced bythe estimated standard deviation s, also known as the standard error. Since the standard error is an estimate for the true value ofthe standard deviation, the distribution of the sample mean

is no longer normal with mean

and standard deviation

. Instead, the sample mean follows the t distribution with mean

and standard deviation

. The t distribution is also described byits degrees of freedom. For a sample of size n, the t distributionwill have n-1 degrees of freedom. The notation for a t distribution with k degrees of freedom is t(k). As the sample size nincreases, the t distribution becomes closer to the normal distribution, since the standarderror approaches the true standard deviation

for large n.

For claims about a population mean from a population with a normal distribution or for any sample with large sample size n (for which the sample mean will follow a normal distribution by the Central Limit Theorem) with unknown standard deviation, the appropriate significance test is known as the t-test, where the teststatistic is defined as t = .

The test statistic follows the t distribution with n-1 degrees of freedom. The test statistic z is used to compute the P-value for the t distribution, the probability that a value at least as extreme as the test statistic would be observed under the null hypothesis.

Example

The dataset "Normal Body Temperature, Gender, and Heart Rate" contains 130 observations ofbody temperature, along with the gender of each individual and his or her heart rate. Usingthe MINITAB "DESCRIBE" command provides the following information:

Descriptive StatisticsVariable N Mean Median Tr Mean StDev SE MeanTEMP 130 98.249 98.300 98.253 0.733 0.064Variable Min Max Q1 Q3TEMP 96.300 100.800 97.800 98.700

Since the normal body temperature is generally assumed to be 98.6 degrees Fahrenheit, one can usethe data to test the following one-sided hypothesis:

H₀: = 98.6 vs
H_a: < 98.6.

The t test statistic is equal to (98.249 - 98.6)/0.064 = -0.351/0.064 = -5.48. P(t< -5.48) = P(t> 5.48). The t distribution with129 degrees of freedom may be approximated by the t distribution with 100 degreesof freedom (found in Table E in Moore and McCabe), where P(t> 5.48) is less than 0.0005. This result is significant at the 0.01 level and beyond, indicating thatthe null hypotheses can be rejected with confidence.

To perform this t-test in MINITAB, the "TTEST" command with the "ALTERNATIVE" subcommandmay be applied as follows:

MTB > ttest mu = 98.6 c1;SUBC > alt= -1.T-Test of the MeanTest of mu = 98.6000 vs mu < 98.6000Variable N Mean StDev SE Mean T PTEMP 130 98.2492 0.7332 0.0643 -5.45 0.0000

These results represents the exact calculations for the t(129) distribution.

Data source: Data presented in Mackowiak, P.A., Wasserman, S.S., and Levine, M.M. (1992),"A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the Normal Body Temperature, andOther Legacies of Carl Reinhold August Wunderlich," Journal of the American Medical Association, 268, 1578-1580. Dataset available through the JSE Dataset Archive.

Matched Pairs

In many experiments, one wishes to compare measurements from two populations.This is common in medical studies involving control groups, for example,as well as in studies requiring before-and-after measurements. Such studies have a matched pairs design, where the difference between the two measurements in each pair is the parameter of interest.

Analysis of data from a matched pairs experiment compares the two measurementsby subtracting one from the other and basing test hypotheses upon the differences. Usually, the null hypothesis H₀ assumesthat that the mean of these differences is equal to 0, while the alternativehypothesis H_a claims that the mean of the differencesis not equal to zero (the alternative hypothesis may be one- or two-sided,depending on the experiment). Using the differences between the pairedmeasurements as single observations, the standard t procedures withn-1 degrees of freedom are followed as above.

Example

In the "Helium Football" experiment, a punter was given two footballs to kick, one filled with air and the other filled with helium. The punter wasunaware of the difference between the balls, and was asked to kick each ball39 times. The balls were alternated for each kick, so each of the 39 trialscontains one measurement for the air-filled ball and one measurement forthe helium-filled ball. Given that the conditions (leg fatigue, etc.)were basically the same for each kick within a trial, a matched pairs analysisof the trials is appropriate. Is there evidence that the helium-filled ballimproved the kicker's performance?

In MINITAB, subtracting the air-filled measurement from the helium-filled measurement for eachtrial and applying the "DESCRIBE" command to the resulting differences gives the followingresults:

Descriptive StatisticsVariable N Mean Median Tr Mean StDev SE MeanHel. - Air 39 0.46 1.00 0.40 6.87 1.10Variable Min Max Q1 Q3Hel. - Air -14.00 17.00 -2.00 4.00

Using MINITAB to perform a t-test of the null hypothesis H₀:

= 0 vsH_a:

> 0 gives the following analysis:

T-Test of the MeanTest of mu = 0.00 vs mu > 0.00Variable N Mean StDev SE Mean T PHel. - A 39 0.46 6.87 1.10 0.42 0.34

The P-Value of 0.34 indicates that this result is not significant at any acceptablelevel. A 95% confidence interval for the t-distribution with 38 degrees of freedom for the difference in measurements is (-1.76, 2.69), computed using the MINITAB "TINTERVAL" command.

Data source: Lafferty, M.B. (1993), "OSU scientists get a kick out of sports controversy,"The Columbus Dispatch (November 21, 1993), B7. Dataset available through the Statlib Data and Story Library (DASL).

The Sign Test

Another method of analysis for matched pairs data is a distribution-free test knownas the sign test. This test does not require any normality assumptions about thedata, and simply involves counting the number of positive differences between the matched pairs and relating these to a binomial distribution. The concept behind thesign test reasons that if there is no true difference, then the probability ofobserving an increase in each pair is equal to the probability of observing a decrease ineach pair: p = 1/2. Assuming each pair is independent, the null hypothesis follows thedistribution B(n,1/2), where n is the number of pairs where some differenceis observed.

To perform a sign test on matched pairs data, take the difference between the two measurements in each pair and count the number of non-zero differences n. Of these, count the numberof positive differences X. Determine the probability of observing X positivedifferences for a B(n,1/2) distribution, and use this probability as a P-valuefor the null hypothesis.

Example

In the "Helium Football" example above, 2 of the 39 trials recorded no difference betweenkicks for the air-filled and helium-filled balls. Of the remaining 37 trials, 20 recordeda positive difference between the two kicks. Under the null hypothesis, p = 1/2, thedifferences would follow the B(37,1/2) distribution. The probability of observing 20or more positive differences, P(X>20) = 1 - P(X<19) =1 - 0.6286 = 0.3714. This value indicates that there is not strong evidence against the nullhypothesis, as observed previously with the t-test.

RETURN TO MAIN PAGE.