User login
Introduction
Many clinicians consider the P value as an almost magical number that determines whether treatment effects exist or not. Is that a correct understanding?
In order to grasp the conceptual meaning of the P value, consider comparing two treatments, A and B, and finding that A is twice as effective as B. Does it mean that treatment A is better in reality? We cannot be sure from that information alone. It may be that treatment A is truly better than treatment B (i.e., true positive). However, it may also be that by chance we have collected a sample in which more people respond to treatment A, making it appear as more effective, when in reality it is equally effective as treatment B (i.e., false positive).How do we discern whether the first or the second scenario is true? The P value can help us with that. Conceptually, the P value can be thought of as the probability of observing these results (A is twice as effective as B) by chance if in reality there is no difference between A and B. It is therefore the probability of having a false-positive finding (also called type I or alpha error).
An arbitrary definition
If the P value is less than 5% (P less than .05) that means that there is less than a 5% probability that we would observe the above results if in reality treatment A and treatment B were equally effective. Since this probability is very small, the convention is to reject the idea that both treatments are equally effective and declare that treatment A is indeed more effective.
The P value is thus a probability, and “statistical significance” depends simply on 5% being considered the cutoff for sufficiently low enough probability to make chance an unlikely explanation for the observed results. As you can see this is an arbitrary cutoff; it could have been 4% or 6%, and the concept would not have changed.1
Power
Thus, simply looking at the P value itself is insufficient. We need to interpret it in light of other information.2 Before doing that, we need to introduce a new related statistical concept, that of “power.” The power of a study can be conceptually understood as the ability to detect a difference if there truly is one. If there is a difference in reality between treatments A and B, then the power of a study is the ability to detect that difference.
Two factors influence power: the effect size (that is, the difference between A and B) and the sample size. If the effect size is large, then even with small samples we can detect it. For example, if treatment A was effective in 100% of the cases, and treatment B only in 10% of cases, then the difference will be clear even with a small number of patients. Conversely, if the effect size is small, then we would need a very large sample size to detect that difference. For example, if treatment A is effective in 20% of cases, and treatment B is effective in 22% of cases, the difference between them could be observed only if we enrolled a very large number of patients. A large sample size increases the power of a study. This has important implications for the interpretation of the P value.
How (not) to interpret the P value
Many clinicians do not consider other factors when interpreting the P value, and assume that the dichotomization of results as “significant” and “nonsignificant” accurately reflects reality.3
Authors may say something like the following: “Treatment A was effective in 50% of patients, and treatment B was effective in 20% of the patients, but there was no difference between them (P = .059).” The reason why they declare this as “no difference” is because there is no “statistically significant difference” if P = .059. However, this does not mean that there is no difference.
First, if the convention for the cutoff value for significance was another arbitrary value, say 0.06, then this would have been a statistically significant finding.
Second, we should pay attention to the magnitude of the P value when interpreting the results. As per definition above, the P value is simply the probability of a false-positive result. However, these probabilities may be greater than 5% with varying degrees. For example, a probability of false positive of 80% (P = .80) is very different from a probability of 6% (P = .059), even though, technically, both are “nonsignificant.” A P value of .059 can be interpreted to mean that there is possibly some “signal” of real difference in the data. It may be that the study above was not powered enough to see the difference of 30 percentage points between the treatments as statistically significant; had the sample size been larger and thus provided greater power, then the finding could have been significant. Instead of reporting that there is no difference, it would be better to say that these results are suggestive of a difference, but that there was not enough power to detect it. Alternatively, P = .059 can be considered as “marginally nonsignificant” to qualitatively differentiate it from larger values, say P = .80, which are clearly nonsignificant.
Third, a key distinction is that between clinical and statistical significance. In the example above, even though the study was not statistically significant (P = .059), a difference of 30% seems clinically important. The difference between clinical and statistical significance can perhaps be better illustrated with the opposite, and more common, mistake. As mentioned, a large sample size increases power, thus the ability to detect even minor differences. For example, if a study enrolls 100,000 participants in each arm, then even a difference of 0.2% between treatments A and B will be statistically significant. However, this difference is clinically irrelevant. Thus, when researchers report “statistically significant” results, careful attention must be paid to the clinical significance of those results. The purpose of the studies is to uncover reality, not to be technical about conventions.
Multiple testing and P value
Finally, another almost universally ignored problem in clinical research papers is that of multiple testing. It is not uncommon to read papers in which the authors present results for 20 different and independent hypotheses tests, and when one of them has a P value less than .05 they declare it as a significant finding. However, this is clearly mistaken. The more tests are made, the higher the probability of false positives. Imagine having 20 balls and only one of them is red. If you pick a random ball only once you have a 5% probability of picking the red one. If, however, you try it 10 different times, the probability of picking the red ball is higher (approximately 40%). Similarly, if we perform only one test, then the probability of a false positive is 5%; however, if we perform many tests, then the probability of a false positive is higher than 5%.
There are three main ways to deal with this problem. The first is to have only one main outcome declaring statistical significance for only that outcome and consider the other outcomes as exploratory. The second is to report on multiple findings and correct for multiple testing. The third is to report on multiple findings, but mention explicitly in the paper that they have not corrected for multiple testing and therefore the findings may be significant by chance.
Conclusion
In summary, the P value is the probability of a false-positive finding, and the cutoff of .05 is arbitrary. Instead of dichotomizing results as “significant” and “nonsignificant” purely based on whether the P value is more or less than .05, a more qualitative approach that takes into account the magnitude of the P value and the sample size should be considered, and multiple testing should be taken into account when declaring significant findings.
Dr. Jovani is a therapeutic endoscopy fellow, division of gastroenterology and hepatology, Johns Hopkins Hospital, Baltimore.
References
1. Guyatt G et al. CMAJ. 1995;152:27-32.
2. Guller U and DeLong ER. J Am Coll Surg. 2004;198:441-58.
3. Greenland S et al. Eur J Epidemiol. 2016;31:337-50.
Introduction
Many clinicians consider the P value as an almost magical number that determines whether treatment effects exist or not. Is that a correct understanding?
In order to grasp the conceptual meaning of the P value, consider comparing two treatments, A and B, and finding that A is twice as effective as B. Does it mean that treatment A is better in reality? We cannot be sure from that information alone. It may be that treatment A is truly better than treatment B (i.e., true positive). However, it may also be that by chance we have collected a sample in which more people respond to treatment A, making it appear as more effective, when in reality it is equally effective as treatment B (i.e., false positive).How do we discern whether the first or the second scenario is true? The P value can help us with that. Conceptually, the P value can be thought of as the probability of observing these results (A is twice as effective as B) by chance if in reality there is no difference between A and B. It is therefore the probability of having a false-positive finding (also called type I or alpha error).
An arbitrary definition
If the P value is less than 5% (P less than .05) that means that there is less than a 5% probability that we would observe the above results if in reality treatment A and treatment B were equally effective. Since this probability is very small, the convention is to reject the idea that both treatments are equally effective and declare that treatment A is indeed more effective.
The P value is thus a probability, and “statistical significance” depends simply on 5% being considered the cutoff for sufficiently low enough probability to make chance an unlikely explanation for the observed results. As you can see this is an arbitrary cutoff; it could have been 4% or 6%, and the concept would not have changed.1
Power
Thus, simply looking at the P value itself is insufficient. We need to interpret it in light of other information.2 Before doing that, we need to introduce a new related statistical concept, that of “power.” The power of a study can be conceptually understood as the ability to detect a difference if there truly is one. If there is a difference in reality between treatments A and B, then the power of a study is the ability to detect that difference.
Two factors influence power: the effect size (that is, the difference between A and B) and the sample size. If the effect size is large, then even with small samples we can detect it. For example, if treatment A was effective in 100% of the cases, and treatment B only in 10% of cases, then the difference will be clear even with a small number of patients. Conversely, if the effect size is small, then we would need a very large sample size to detect that difference. For example, if treatment A is effective in 20% of cases, and treatment B is effective in 22% of cases, the difference between them could be observed only if we enrolled a very large number of patients. A large sample size increases the power of a study. This has important implications for the interpretation of the P value.
How (not) to interpret the P value
Many clinicians do not consider other factors when interpreting the P value, and assume that the dichotomization of results as “significant” and “nonsignificant” accurately reflects reality.3
Authors may say something like the following: “Treatment A was effective in 50% of patients, and treatment B was effective in 20% of the patients, but there was no difference between them (P = .059).” The reason why they declare this as “no difference” is because there is no “statistically significant difference” if P = .059. However, this does not mean that there is no difference.
First, if the convention for the cutoff value for significance was another arbitrary value, say 0.06, then this would have been a statistically significant finding.
Second, we should pay attention to the magnitude of the P value when interpreting the results. As per definition above, the P value is simply the probability of a false-positive result. However, these probabilities may be greater than 5% with varying degrees. For example, a probability of false positive of 80% (P = .80) is very different from a probability of 6% (P = .059), even though, technically, both are “nonsignificant.” A P value of .059 can be interpreted to mean that there is possibly some “signal” of real difference in the data. It may be that the study above was not powered enough to see the difference of 30 percentage points between the treatments as statistically significant; had the sample size been larger and thus provided greater power, then the finding could have been significant. Instead of reporting that there is no difference, it would be better to say that these results are suggestive of a difference, but that there was not enough power to detect it. Alternatively, P = .059 can be considered as “marginally nonsignificant” to qualitatively differentiate it from larger values, say P = .80, which are clearly nonsignificant.
Third, a key distinction is that between clinical and statistical significance. In the example above, even though the study was not statistically significant (P = .059), a difference of 30% seems clinically important. The difference between clinical and statistical significance can perhaps be better illustrated with the opposite, and more common, mistake. As mentioned, a large sample size increases power, thus the ability to detect even minor differences. For example, if a study enrolls 100,000 participants in each arm, then even a difference of 0.2% between treatments A and B will be statistically significant. However, this difference is clinically irrelevant. Thus, when researchers report “statistically significant” results, careful attention must be paid to the clinical significance of those results. The purpose of the studies is to uncover reality, not to be technical about conventions.
Multiple testing and P value
Finally, another almost universally ignored problem in clinical research papers is that of multiple testing. It is not uncommon to read papers in which the authors present results for 20 different and independent hypotheses tests, and when one of them has a P value less than .05 they declare it as a significant finding. However, this is clearly mistaken. The more tests are made, the higher the probability of false positives. Imagine having 20 balls and only one of them is red. If you pick a random ball only once you have a 5% probability of picking the red one. If, however, you try it 10 different times, the probability of picking the red ball is higher (approximately 40%). Similarly, if we perform only one test, then the probability of a false positive is 5%; however, if we perform many tests, then the probability of a false positive is higher than 5%.
There are three main ways to deal with this problem. The first is to have only one main outcome declaring statistical significance for only that outcome and consider the other outcomes as exploratory. The second is to report on multiple findings and correct for multiple testing. The third is to report on multiple findings, but mention explicitly in the paper that they have not corrected for multiple testing and therefore the findings may be significant by chance.
Conclusion
In summary, the P value is the probability of a false-positive finding, and the cutoff of .05 is arbitrary. Instead of dichotomizing results as “significant” and “nonsignificant” purely based on whether the P value is more or less than .05, a more qualitative approach that takes into account the magnitude of the P value and the sample size should be considered, and multiple testing should be taken into account when declaring significant findings.
Dr. Jovani is a therapeutic endoscopy fellow, division of gastroenterology and hepatology, Johns Hopkins Hospital, Baltimore.
References
1. Guyatt G et al. CMAJ. 1995;152:27-32.
2. Guller U and DeLong ER. J Am Coll Surg. 2004;198:441-58.
3. Greenland S et al. Eur J Epidemiol. 2016;31:337-50.
Introduction
Many clinicians consider the P value as an almost magical number that determines whether treatment effects exist or not. Is that a correct understanding?
In order to grasp the conceptual meaning of the P value, consider comparing two treatments, A and B, and finding that A is twice as effective as B. Does it mean that treatment A is better in reality? We cannot be sure from that information alone. It may be that treatment A is truly better than treatment B (i.e., true positive). However, it may also be that by chance we have collected a sample in which more people respond to treatment A, making it appear as more effective, when in reality it is equally effective as treatment B (i.e., false positive).How do we discern whether the first or the second scenario is true? The P value can help us with that. Conceptually, the P value can be thought of as the probability of observing these results (A is twice as effective as B) by chance if in reality there is no difference between A and B. It is therefore the probability of having a false-positive finding (also called type I or alpha error).
An arbitrary definition
If the P value is less than 5% (P less than .05) that means that there is less than a 5% probability that we would observe the above results if in reality treatment A and treatment B were equally effective. Since this probability is very small, the convention is to reject the idea that both treatments are equally effective and declare that treatment A is indeed more effective.
The P value is thus a probability, and “statistical significance” depends simply on 5% being considered the cutoff for sufficiently low enough probability to make chance an unlikely explanation for the observed results. As you can see this is an arbitrary cutoff; it could have been 4% or 6%, and the concept would not have changed.1
Power
Thus, simply looking at the P value itself is insufficient. We need to interpret it in light of other information.2 Before doing that, we need to introduce a new related statistical concept, that of “power.” The power of a study can be conceptually understood as the ability to detect a difference if there truly is one. If there is a difference in reality between treatments A and B, then the power of a study is the ability to detect that difference.
Two factors influence power: the effect size (that is, the difference between A and B) and the sample size. If the effect size is large, then even with small samples we can detect it. For example, if treatment A was effective in 100% of the cases, and treatment B only in 10% of cases, then the difference will be clear even with a small number of patients. Conversely, if the effect size is small, then we would need a very large sample size to detect that difference. For example, if treatment A is effective in 20% of cases, and treatment B is effective in 22% of cases, the difference between them could be observed only if we enrolled a very large number of patients. A large sample size increases the power of a study. This has important implications for the interpretation of the P value.
How (not) to interpret the P value
Many clinicians do not consider other factors when interpreting the P value, and assume that the dichotomization of results as “significant” and “nonsignificant” accurately reflects reality.3
Authors may say something like the following: “Treatment A was effective in 50% of patients, and treatment B was effective in 20% of the patients, but there was no difference between them (P = .059).” The reason why they declare this as “no difference” is because there is no “statistically significant difference” if P = .059. However, this does not mean that there is no difference.
First, if the convention for the cutoff value for significance was another arbitrary value, say 0.06, then this would have been a statistically significant finding.
Second, we should pay attention to the magnitude of the P value when interpreting the results. As per definition above, the P value is simply the probability of a false-positive result. However, these probabilities may be greater than 5% with varying degrees. For example, a probability of false positive of 80% (P = .80) is very different from a probability of 6% (P = .059), even though, technically, both are “nonsignificant.” A P value of .059 can be interpreted to mean that there is possibly some “signal” of real difference in the data. It may be that the study above was not powered enough to see the difference of 30 percentage points between the treatments as statistically significant; had the sample size been larger and thus provided greater power, then the finding could have been significant. Instead of reporting that there is no difference, it would be better to say that these results are suggestive of a difference, but that there was not enough power to detect it. Alternatively, P = .059 can be considered as “marginally nonsignificant” to qualitatively differentiate it from larger values, say P = .80, which are clearly nonsignificant.
Third, a key distinction is that between clinical and statistical significance. In the example above, even though the study was not statistically significant (P = .059), a difference of 30% seems clinically important. The difference between clinical and statistical significance can perhaps be better illustrated with the opposite, and more common, mistake. As mentioned, a large sample size increases power, thus the ability to detect even minor differences. For example, if a study enrolls 100,000 participants in each arm, then even a difference of 0.2% between treatments A and B will be statistically significant. However, this difference is clinically irrelevant. Thus, when researchers report “statistically significant” results, careful attention must be paid to the clinical significance of those results. The purpose of the studies is to uncover reality, not to be technical about conventions.
Multiple testing and P value
Finally, another almost universally ignored problem in clinical research papers is that of multiple testing. It is not uncommon to read papers in which the authors present results for 20 different and independent hypotheses tests, and when one of them has a P value less than .05 they declare it as a significant finding. However, this is clearly mistaken. The more tests are made, the higher the probability of false positives. Imagine having 20 balls and only one of them is red. If you pick a random ball only once you have a 5% probability of picking the red one. If, however, you try it 10 different times, the probability of picking the red ball is higher (approximately 40%). Similarly, if we perform only one test, then the probability of a false positive is 5%; however, if we perform many tests, then the probability of a false positive is higher than 5%.
There are three main ways to deal with this problem. The first is to have only one main outcome declaring statistical significance for only that outcome and consider the other outcomes as exploratory. The second is to report on multiple findings and correct for multiple testing. The third is to report on multiple findings, but mention explicitly in the paper that they have not corrected for multiple testing and therefore the findings may be significant by chance.
Conclusion
In summary, the P value is the probability of a false-positive finding, and the cutoff of .05 is arbitrary. Instead of dichotomizing results as “significant” and “nonsignificant” purely based on whether the P value is more or less than .05, a more qualitative approach that takes into account the magnitude of the P value and the sample size should be considered, and multiple testing should be taken into account when declaring significant findings.
Dr. Jovani is a therapeutic endoscopy fellow, division of gastroenterology and hepatology, Johns Hopkins Hospital, Baltimore.
References
1. Guyatt G et al. CMAJ. 1995;152:27-32.
2. Guller U and DeLong ER. J Am Coll Surg. 2004;198:441-58.
3. Greenland S et al. Eur J Epidemiol. 2016;31:337-50.