Is the P value as reliable as we believe? This article examines the meaning and limitations of this ubiquitous measure of significance
“The null hypothesis is never proved or established, but is possibly disproved.”
– Ronald A Fisher
Statistics is a versatile field that emerged from 17th century political science and state governance. In time it evolved into a vast and sophisticated discipline and yet has remained a powerful, widely-used instrument to analyse, summarise and communicate key scientific findings from numerous other academic spheres.
Statistical tools have become indispensible to research, and with good reason. When intangible or abstract associations can be structured into numerical or mathematical descriptions, it allows reproducibility, the central tenet of research.
The simple beauty of the quantitative approach shines through in cases when intuition or common sense may mislead us. Sandeep Pulla, a PhD student studying forest dynamics at IISc, uses the classic example of the birthday problem to illustrate this: what are the odds of two people in a group of 23 sharing a birthday? Answer: a 50-50 chance. Based on everyday logic, this seems absurdly high, but combinatorics doesn’t lie.
Statistics is especially critical when we investigate complex phenomena because rarely can we bring all sources of variation under control. As we gather various observations, it becomes important to clarify to what degree our numerical descriptions reflect the real world.
P value: the gold standard?
The P value appears to provide answers to the questions “Is the pattern observed in the study real or is it part of natural chance variation?”, and “How reliable is this result?”.
The concept of the P value – or calculated probability – was introduced by Ronald A Fisher, a pioneer in statistics and biology, as an informal index of discrepancy between the data and the null hypothesis. Fisher was considered a genius and often an outsider, who formed lifelong feuds and braved incredible hardships. He invented many important statistical techniques and formalized several others, which remain in use even today. He also travelled widely, forming close associations with great minds of the day, including PC Mahalanobis and RC Bose of the Indian Statistical Institute.
Fisher did not intend for the P value to be a definitive measure of data reliability. Instead, he devised it to judge if a scenario was worth investigating, worth a second look. The lower the P value, the less likely that the observed result was found by chance in the absence of a true effect.
Convention dictates that a threshold value of P is chosen, below which the null hypothesis may be rejected and the results considered “significant”. This threshold value of P is widely accepted to be 0.05.
Fisher’s comments on the P value in his seminal work Statistical Methods for Research Workers, published in 1925, reveal the elusive nature of this threshold: “The value for which P =.05, or 1 in 20, is 1.96 or nearly 2 ; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not . . . Small effects would still escape notice if the data were insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.”
The P value is not an objective statement of statistical significance, and certainly not of biological [or real world] significance, argues Kavita Isvaran, an assistant professor at the Centre for Ecological Sciences.
She points out that the dichotomy dictated by P value significance is a false one. “A P value of even 0.045 can make us happy and encourage us to build large narratives while P=0.055 can make us unhappy with the ‘weak trend’ and give up on it, when both those values must be treated with similar uncertainty.”
Perhaps the most common misconception is to consider the P value as the probability that the null hypothesis is true. The common perspective to significance is deeply entrenched today. Pulla clarifies, “Statistics is useful, but aiming for significance is what leads to problems because then we enter the world of human cognitive biases, and that is a grey area of seeing what we want to see in our data or as patterns around us, and misunderstanding statistical techniques.”
As we gather various observations, it becomes important to clarify to what degree our numerical descriptions reflect the real world
The slippery nature of the P value has been largely overlooked by the research community contributing to the reproducibility crisis in science. A survey by Nature shows that more than 70% of researchers were unable to reproduce another scientist’s experiments and were often unable to reproduce their own experiments. Much time and resources have thus been exhausted in pursuing false leads.
What is the bottomline, then?
“In real world situations, the statistical null hypothesis (for example, that two means are exactly the same or that a relationship is exactly equal to zero) is rarely true. Our task is therefore to estimate the thing we are interested in with high precision,” says Isvaran. She argues that if, instead, we focus on P values and whether an effect is present or not, we may ignore an important biological effect just because we get a P value that falls on the wrong side of our cut-off. The low precision, she says, could result from a small sample size. And conversely, with large enough sample sizes, we run the risk of getting small P values and establishing the statistical significance of trivial effects.
The result of mishandling statistics is just that – the outcome of unintentional laxity or ignorance. Persistent misguided application of the P value has led to what is now infamous as P-hacking, also called data dredging, significance questing, selective inference, double-dipping and researcher degrees of freedom.
Several months or years pass before faulty results are identified, often due to a general culture of uncritical approaches to handling data. An example of brazen ambition to yield results that have just the right value of significance is that of Diederik Stapel, a Dutch social psychologist. While his is a rare case of intentional misconduct, it highlights the acceptance of significant results at face value. Stapel knew that the effect he was looking for had to be small in order to be believable; psychology experiments rarely yield significant results. He proceeded to work backwards and generate the data that would yield to the required distributions.
Armed with data, planned and unplanned, a researcher would explore what was at hand
The reality of our times is that most scientists follow ethical practices and contribute to advancements in small increments. The problem is actually caused by taking decisions freely without accounting for them: conducting analyses midway through experiments to evaluate their potential success, recording multiple response variables not part of the original experimental design and incorporating them at a later stage, deciding whether to include or drop outliers after analysis, modifying treatment groups post-analysis, including or excluding covariates post-analysis, and terminating data exploration if an analysis yields a significant P value.
Isvaran points out that such missteps are easy to take. “If I observe new and seemingly important features while I’m collecting planned data, I would, of course, record these observations.” Armed with data, planned and unplanned, a researcher would explore what was at hand. “However, it is well known that when one explores patterns in the data, one runs the risk of finding spurious relationships. So it is important to separate careful and targeted analyses from when one explores a large number of possible patterns.”
According to Pulla, an interesting result in a field of scientific inquiry is one that is novel, unexpected, contrary to the current state of the field and of course, and therefore, likely to be accepted for publication in high-impact journals. These very characteristics necessitate further scrutiny of such results. However, it is often the case that fantastic findings are accepted and developed upon, sooner than replication studies can verify them. Pulla points out that the editors and reviewers of scientific journals may themselves not be fully aware of the nuances of statistical tests.
So how common is bias from P-hacking?
A study found that while P-hacking is widespread, it does not drastically alter conclusions drawn from meta-analyses, analyses that combine data from several studies.
The use of the term P-hacking, and its synonyms, affects how the issue is perceived in the scientific world. These terms suggest that researchers persistently explore their data until the desired result is observable and significant. However there is a glitch in the critique against P-hacking: terms like ‘data fishing’ suggest that those who use P values unwisely are, in some sense, mindful cheats. And when researchers believe they lack the intent to manipulate data, they may forgo necessary caution when preparing for and performing data analysis.
To better describe the issue, the term garden of forking paths was introduced because it conveys the idea that the paths, or choices, are all out there. The chosen method of analysis may be reasonable given the data and assumptions, but had the data been different, there may be other equally reasonable analyses, which may alter the conclusions. But unfortunately, very little information about these behind-the-scenes decisions is evident in the reported results.
Particle physics research has attempted to address the issues associated with the P value by using a much more stringent standard than those used in other sciences. The groundbreaking discovery of the Higgs boson adhered to this standard as well. However this threshold is also just a consequence of statistical anomalies being paraded as new discoveries using a formerly acceptable standard.
What can I do about it?
In today’s competitive world, there is more pressure today than ever before on young researchers to find novel and broad questions to pursue. In the journey to find one’s place in the prolific world of research, one can give in to the tendency to look for patterns, meet the impact-factor yardsticks for professional excellence, imbibe widely echoed sentiments, play the media game.
To prevent P-hacking, Isvaran believes that one must remind oneself about the original motivation and study design. “Let me clearly separate out the exciting unplanned exploratory findings from the planned confirmatory tests that I designed the study for.” There is no reason why fascinating post facto findings should not be reported (so long as they are clearly labelled as such), along with the findings from the original design.
Another measure against losing one’s way in the garden of forking paths is declaring the research plan in advance. Pre-registration of studies in hubs such as Open Science Framework publicly specify the study in such a way that all its aspects are accounted for well before the project is underway. This eliminates hidden changes being made to the original approach.
Another measure against losing one’s way in the garden of forking paths is declaring the research plan in advance
“You want to be aware that there are alternate explanations for any correlation, and give them equal weightage in your thinking.” says Pulla. ”The simulation approach is powerfuI, especially when using large datasets and advanced techniques. It is worth applying your models to datasets for which you already know the answers. It makes you more aware of the caveats in the method you have used, and more tentative in your approach to interpretation.”
It is also the responsibility of mentors to make students aware of cognitive biases, pitfalls of a particular decision, and to emphasise the importance of content over publication in a well-known journal, says Isvaran. The onus is on those more experienced to share their knowledge. But each of us must work toward a better understanding of all our scientific tools. When the integrity of your work is in question, ignorance is culpability.
I gratefully acknowledge the invaluable contributions of Dr. Kavita Isvaran, Sandeep Pulla, Dr. NV Joshi and Dr. Hari Sridhar to this article.
Upasana studies genetic diversity in Asian elephants at the Centre for Ecological Sciences