“P < 0.05” Might Not Mean What You Think: American Statistical Association Clarifies P Values (2024)

In 2011, the U.S. Supreme Court unanimously ruled in Matrixx Initiatives Inc. v. Siracusano that investors could sue a drug company for failing to report adverse drug effects—even though they were not statistically significant.

Describing the case in the April 2, 2011, issue of the Wall Street Journal, Carl Bialik wrote, “A group of mathematicians has been trying for years to have a core statistical concept debunked. Now the Supreme Court might have done it for them.” That conclusion may have been overly optimistic, since misguided use of the P value continued unabated. However, in 2014 concerns about misinterpretation and misuse of P values led the American Statistical Association (ASA) Board to convene a panel of statisticians and experts from a variety of disciplines to draft a policy statement on the use of P values and hypothesis testing. After a year of discussion, ASA published a consensus statement in American Statistician (doi:10.1080/00031305.2016.1154108).

The statement consists of six principles in nontechnical language on the proper interpretation of P values, hypothesis testing, science and policy decision-making, and the necessity for full reporting and transparency of research studies. However, assembling a short, clear statement by such a diverse group took longer and was more contentious than expected. Participants wrote supplementary commentaries, available online with the published statement.

The panel discussed many misconceptions about P values. Test your knowledge: Which of the following is true?

P > 0.05 is the probability that the null hypothesis is true.
See Also
Level of Significance (Statistical Significance) | Definition & Steps Hypothesis Testing for Means & Proportions Tests of Significance
1 minus the P value is the probability that the alternative hypothesis is true.
A statistically significant test result (P ≤ 0.05) means that the test hypothesis is false or should be rejected.
A P value greater than 0.05 means that no effect was observed.

If you answered “none of the above,” you may understand this slippery concept better than many researchers. The ASA panel defined the P value as “the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.”

Why is the exact definition so important? Many authors use statistical software that presumably is based on the correct definition. “It’s very easy for researchers to get papers published and survive based on knowledge of what statistical packages are out there but not necessarily how to avoid the problems that statistical packages can create for you if you don’t understand their appropriate use,” said Barnett S. Kramer, M.D., M.P.H., JNCI’s former editor in chief and now director of the National Cancer Institute’s Division of Cancer Prevention. (Kramer was not on the ASA panel.)

Part of the problem lies in how people interpret P values. According to the ASA statement, “A conclusion does not immediately become ‘true’ on one side of the divide and ‘false’ on the other.” Valuable information may be lost because researchers may not pursue “insignificant” results. Conversely, small effects with “significant” P values may be biologically or clinically unimportant. At best, such practices may slow scientific progress and waste resources. At worst, they may cause grievous harm when adverse effects go unreported. The Supreme Court case involved the drug Zicam, which caused permanent hearing loss in some users. Another drug, rofecoxib (Vioxx), was taken off the market because of adverse cardiovascular effects. The drug companies involved did not report those adverse effects because of lack of statistical significance in the original drug tests (Rev. Soc. Econ. 2016;74:83–97; doi:10.1080/00346764.2016.1150730).

ASA panelists encouraged using alternative methods “that emphasize estimation over testing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors; and other approaches such as decision-theoretic modeling and false discovery rates.” However, any method can be used invalidly. “If success is defined based on passing some magic threshold, biases may continue to exert their influence regardless of whether the threshold is defined by a P value, Bayes factor, false-discovery rate, or anything else,” wrote panelist John Ioannidis, Ph.D., professor of medicine and of health research and policy at Stanford University School of Medicine in Stanford , Calif.

Some panelists argued that the P value per se is not the problem and that it has its proper uses. A P value can sometimes be “more informative than an interval”—such as when “the predictor of interest is a multicategorical variable,” said Clarice Weinberg, Ph.D., who was not on the panel. “While it is true that P values are imperfect measures of the extent of evidence against the null hypothesis, confidence intervals have a host of problems of their own,” said Weinberg, deputy chief of the Biostatistics and Computational Biology Branch and principal investigator of the National Institute of Environmental Health Sciences in Research Triangle Park, N.C.

“If success is defined based on passing some magic threshold, biases may continue to exert their influence regardless of whether the threshold is defined by a P value, Bayes factor, false-discovery rate, or anything else.”

Beyond simple misinterpretation of the P value and the associated loss of information, authors consciously or unconsciously but routinely engage in data dredging (aka fishing, P-hacking) and selective reporting. “Any statistical technique can be misused and it can be manipulated especially after you see the data generated from the study,” Kramer said. “You can fish through a sea of data and find one positive finding and then convince yourself that even before you started your study that would have been the key hypothesis and it has a lot of plausibility to the investigator.”

In response to those practices and concerns about replicability in science, some journals have banned the P value and inferential statistics. Others, such as JNCI, require confidence intervals and effect sizes, which “convey what a P value does not: the magnitude and relative importance of an effect,” wrote panel member Regina Nuzzo, Ph.D., professor of mathematics and computer sciences at Gallaudet University in Washington, D.C. (Nature 2014;506:150–2).

How can practice improve? Panel members emphasized the need for full reporting and transparency by authors as well as changes in statistics education. In his commentary, Don Berry, Ph.D., professor of biostatistics at the University of Texas M.D. Anderson Cancer Center in Houston, urged researchers to report every aspect of the study. “The specifics of data collection and curation and even your intentions and motivation are critical for inference. What have you not told the statistician? Have you deleted some data points or experimental units, possibly because they seemed to be outliers?” he wrote.

Kramer advised researchers to “consult a statistician when writing a grant application rather than after the study is finished; limit the number of hypotheses to be tested to a realistic number that doesn’t increase the false discovery rate; be conservative in interpreting the data; don’t consider P = 0.05 as a magic number; and whenever possible, provide confidence intervals.” He also suggested, “Webinars and symposia on this issue will be useful to clinical scientists and bench researchers because they’re often not trained in these principles.” As the ASA statement concludes, “No single index should substitute for scientific reasoning.”