User login
Evaluating a paper: Take care not to be confounded
In an earlier article, we looked at the meaning of the P value.1 This time we will look at another crucial statistical concept: that of confounding.
Confounding, as the name implies, is the recognition that crude associations may not reflect reality, but may instead be the result of outside factors. To illustrate, imagine that you want to study whether smoking increases the risk of death (in statistical terms, smoking is the exposure, and death is the outcome). You follow 5,000 people who smoke and 5,000 people who do not smoke for 10 years. At the end of the follow-up you find that about 40% of nonsmokers died, compared with only 10% of smokers. What do you conclude? At face value it would seem that smoking prevents death. However, before reaching this conclusion you might want to look at other factors. A look at the dataset shows that the average baseline age among nonsmokers was 60 years, whereas among smokers was 40 years. Could this be the cause of the results? You repeat the analysis based on strata of age (i.e., you compare smokers who were aged 60-70 years at baseline with nonsmokers who were aged 60-70 years, smokers who were aged 50-60 years with nonsmokers who were aged 50-60 years, and so on). What you find is that, for each category of age, the percentage of death among smokers was higher. Hence, you now reach the opposite conclusion, namely that smoking does increase the risk of death.
What happened? Why the different result? The answer is that, in this case, age was a confounder. What we initially thought was the effect of smoking was, in reality, at least in part, the effect of age. Overall, more deaths occurred among nonsmokers in the first analysis because they were older at baseline. When we compare people with similar age but who differ on smoking status, then the difference in mortality between them is not because of age (they have the same age) but smoking. Thus, in the second analysis we took age into account, or, in statistical terms, we adjusted for age, whereas the first analysis was, in statistical terms, an unadjusted or crude analysis. We should always be aware of studies with only crude results, because they might be biased/misleading.2
In the example above, age is not the only factor that might influence mortality. Alcohol or drug use, cancer or heart disease, body mass index, or physical activity can also influence death, independently of smoking. How to adjust for all these factors? We cannot do stratified analyses as we did above, because the strata would be too many. The solution is to do a multivariable regression analysis. This is a statistical tool to adjust for multiple factors (or variables) at the same time. When we adjust for all these factors, we are comparing the effect of smoking in people who are the same with regard to all these factors but who differ on smoking status. In statistical terms, we study the effect of smoking, keeping everything else constant. In this way we “isolate” the effect of smoking on death by taking into account all other factors, or, in statistical terms, we study the effect of smoking independently of other factors.
How many factors should be included in a multivariable analysis? As a general rule, the more the better, to reduce confounding. However, the number of variables to include in a regression model is limited by the sample size. The general rule of thumb is that, for every 10 events (for dichotomous outcomes) or 10 people (for continuous outcomes), we can add one variable in the model. If we add more variables than that, then in statistical terms the model becomes overfitted (i.e., it gives results that are specific to that dataset, but may not be applicable to other datasets). Overfitted models can be as biased/misleading as crude models.3
What are we to do about other factors that may affect mortality independently of smoking (e.g., diet), but which are not found in our dataset? Unfortunately, nothing. Since we do not have that information, we cannot adjust for it. In this case, diet is in statistical terms an unmeasured confounder. Unfortunately, in all observational studies there is always at least some degree of unmeasured confounding, because there may be many factors that can influence the outcome (and the exposure) which are not part of the dataset. While some statistical tools have been developed to estimate unmeasured confounding, and therefore interpret the results in its light, unmeasured confounding remains one of the major limitations of observational studies.4
Randomized, controlled trials (RCTs) on the other side do not have this problem in theory. With properly designed RCTs, all confounders, both measured and unmeasured, will be balanced between the two groups. For example, imagine an RCT where some patients are randomized to take drug A or drug B. Because patients are randomly allocated to one group or the other, it is assumed that all other factors are also randomly distributed. Hence, the two groups should be equal to each other with respect to all other factors except our active intervention, namely the type of drug they are taking (A or B). For this reason, in RCTs there is no need to adjust for multiple factors with a multivariable regression analysis, and crude unadjusted results can be presented as unbiased.
There is however a caveat. What happens if one patient who was randomized to take drug A takes drug B instead? Should she still be counted in analysis under drug A (as randomized) or under drug B (as she took it)? The usual practice is to do this and present both. In the first case, we will have the intention-to-treat (ITT) analysis, and in the second case, the per-protocol analysis (PPA). The advantage of the ITT is that it keeps the strength of randomization, namely the balancing of confounders, and therefore can present unbiased results. The advantage of the PPA is that it measures what was actually done in reality. However, in this case there is a departure from the original randomization, and hence there is the possibility of introducing confounding, because now patients are not randomly allocated to one treatment or the other. The larger the departure from randomization, the more probable the introduction of bias/confounding. For example, what if patients with more severe disease took drug A, even though they were randomized to take drug B? That will have an influence the outcome. For this reason, outcomes of the ITT analysis are considered the main results of RCTs, because PPA results can be confounded.
In summary, when reading studies, do not simply accept the results as they are presented, but rather ask yourself: “Could they be confounded by other factors, and therefore be unreliable? What steps did the authors take to reduce confounding? If they presented only crude analyses, and this was not justified by a RCT design, do they recognize it as a major limitation?” There are many nuances in every paper that can be appreciated only through a careful reading of the methods section. Hopefully, this article can shed some light on these issues and help the readers to not be confounded.
References
1. The P value: What to make of it? A simple guide for the uninitiated. GI and Hepatology News. 2019 Sep 23. https://www.mdedge.com/gihepnews/article/208601/mixed-topics/p-value-what-make-it-simple-guide-uninitiated
2. VanderWeele TJ et al. Ann Stat. 2013 Feb;41(1):196-220.
3. Concato J et al. Ann Intern Med. 1993 Feb 1;118(3):201-10.
4. VanderWeele TJ et al. Ann Intern Med. 2017 Aug 15;167(4):268-74.
Dr. Jovani is a therapeutic endoscopy fellow in the division of gastroenterology and hepatology at Johns Hopkins Hospital, Baltimore.
In an earlier article, we looked at the meaning of the P value.1 This time we will look at another crucial statistical concept: that of confounding.
Confounding, as the name implies, is the recognition that crude associations may not reflect reality, but may instead be the result of outside factors. To illustrate, imagine that you want to study whether smoking increases the risk of death (in statistical terms, smoking is the exposure, and death is the outcome). You follow 5,000 people who smoke and 5,000 people who do not smoke for 10 years. At the end of the follow-up you find that about 40% of nonsmokers died, compared with only 10% of smokers. What do you conclude? At face value it would seem that smoking prevents death. However, before reaching this conclusion you might want to look at other factors. A look at the dataset shows that the average baseline age among nonsmokers was 60 years, whereas among smokers was 40 years. Could this be the cause of the results? You repeat the analysis based on strata of age (i.e., you compare smokers who were aged 60-70 years at baseline with nonsmokers who were aged 60-70 years, smokers who were aged 50-60 years with nonsmokers who were aged 50-60 years, and so on). What you find is that, for each category of age, the percentage of death among smokers was higher. Hence, you now reach the opposite conclusion, namely that smoking does increase the risk of death.
What happened? Why the different result? The answer is that, in this case, age was a confounder. What we initially thought was the effect of smoking was, in reality, at least in part, the effect of age. Overall, more deaths occurred among nonsmokers in the first analysis because they were older at baseline. When we compare people with similar age but who differ on smoking status, then the difference in mortality between them is not because of age (they have the same age) but smoking. Thus, in the second analysis we took age into account, or, in statistical terms, we adjusted for age, whereas the first analysis was, in statistical terms, an unadjusted or crude analysis. We should always be aware of studies with only crude results, because they might be biased/misleading.2
In the example above, age is not the only factor that might influence mortality. Alcohol or drug use, cancer or heart disease, body mass index, or physical activity can also influence death, independently of smoking. How to adjust for all these factors? We cannot do stratified analyses as we did above, because the strata would be too many. The solution is to do a multivariable regression analysis. This is a statistical tool to adjust for multiple factors (or variables) at the same time. When we adjust for all these factors, we are comparing the effect of smoking in people who are the same with regard to all these factors but who differ on smoking status. In statistical terms, we study the effect of smoking, keeping everything else constant. In this way we “isolate” the effect of smoking on death by taking into account all other factors, or, in statistical terms, we study the effect of smoking independently of other factors.
How many factors should be included in a multivariable analysis? As a general rule, the more the better, to reduce confounding. However, the number of variables to include in a regression model is limited by the sample size. The general rule of thumb is that, for every 10 events (for dichotomous outcomes) or 10 people (for continuous outcomes), we can add one variable in the model. If we add more variables than that, then in statistical terms the model becomes overfitted (i.e., it gives results that are specific to that dataset, but may not be applicable to other datasets). Overfitted models can be as biased/misleading as crude models.3
What are we to do about other factors that may affect mortality independently of smoking (e.g., diet), but which are not found in our dataset? Unfortunately, nothing. Since we do not have that information, we cannot adjust for it. In this case, diet is in statistical terms an unmeasured confounder. Unfortunately, in all observational studies there is always at least some degree of unmeasured confounding, because there may be many factors that can influence the outcome (and the exposure) which are not part of the dataset. While some statistical tools have been developed to estimate unmeasured confounding, and therefore interpret the results in its light, unmeasured confounding remains one of the major limitations of observational studies.4
Randomized, controlled trials (RCTs) on the other side do not have this problem in theory. With properly designed RCTs, all confounders, both measured and unmeasured, will be balanced between the two groups. For example, imagine an RCT where some patients are randomized to take drug A or drug B. Because patients are randomly allocated to one group or the other, it is assumed that all other factors are also randomly distributed. Hence, the two groups should be equal to each other with respect to all other factors except our active intervention, namely the type of drug they are taking (A or B). For this reason, in RCTs there is no need to adjust for multiple factors with a multivariable regression analysis, and crude unadjusted results can be presented as unbiased.
There is however a caveat. What happens if one patient who was randomized to take drug A takes drug B instead? Should she still be counted in analysis under drug A (as randomized) or under drug B (as she took it)? The usual practice is to do this and present both. In the first case, we will have the intention-to-treat (ITT) analysis, and in the second case, the per-protocol analysis (PPA). The advantage of the ITT is that it keeps the strength of randomization, namely the balancing of confounders, and therefore can present unbiased results. The advantage of the PPA is that it measures what was actually done in reality. However, in this case there is a departure from the original randomization, and hence there is the possibility of introducing confounding, because now patients are not randomly allocated to one treatment or the other. The larger the departure from randomization, the more probable the introduction of bias/confounding. For example, what if patients with more severe disease took drug A, even though they were randomized to take drug B? That will have an influence the outcome. For this reason, outcomes of the ITT analysis are considered the main results of RCTs, because PPA results can be confounded.
In summary, when reading studies, do not simply accept the results as they are presented, but rather ask yourself: “Could they be confounded by other factors, and therefore be unreliable? What steps did the authors take to reduce confounding? If they presented only crude analyses, and this was not justified by a RCT design, do they recognize it as a major limitation?” There are many nuances in every paper that can be appreciated only through a careful reading of the methods section. Hopefully, this article can shed some light on these issues and help the readers to not be confounded.
References
1. The P value: What to make of it? A simple guide for the uninitiated. GI and Hepatology News. 2019 Sep 23. https://www.mdedge.com/gihepnews/article/208601/mixed-topics/p-value-what-make-it-simple-guide-uninitiated
2. VanderWeele TJ et al. Ann Stat. 2013 Feb;41(1):196-220.
3. Concato J et al. Ann Intern Med. 1993 Feb 1;118(3):201-10.
4. VanderWeele TJ et al. Ann Intern Med. 2017 Aug 15;167(4):268-74.
Dr. Jovani is a therapeutic endoscopy fellow in the division of gastroenterology and hepatology at Johns Hopkins Hospital, Baltimore.
In an earlier article, we looked at the meaning of the P value.1 This time we will look at another crucial statistical concept: that of confounding.
Confounding, as the name implies, is the recognition that crude associations may not reflect reality, but may instead be the result of outside factors. To illustrate, imagine that you want to study whether smoking increases the risk of death (in statistical terms, smoking is the exposure, and death is the outcome). You follow 5,000 people who smoke and 5,000 people who do not smoke for 10 years. At the end of the follow-up you find that about 40% of nonsmokers died, compared with only 10% of smokers. What do you conclude? At face value it would seem that smoking prevents death. However, before reaching this conclusion you might want to look at other factors. A look at the dataset shows that the average baseline age among nonsmokers was 60 years, whereas among smokers was 40 years. Could this be the cause of the results? You repeat the analysis based on strata of age (i.e., you compare smokers who were aged 60-70 years at baseline with nonsmokers who were aged 60-70 years, smokers who were aged 50-60 years with nonsmokers who were aged 50-60 years, and so on). What you find is that, for each category of age, the percentage of death among smokers was higher. Hence, you now reach the opposite conclusion, namely that smoking does increase the risk of death.
What happened? Why the different result? The answer is that, in this case, age was a confounder. What we initially thought was the effect of smoking was, in reality, at least in part, the effect of age. Overall, more deaths occurred among nonsmokers in the first analysis because they were older at baseline. When we compare people with similar age but who differ on smoking status, then the difference in mortality between them is not because of age (they have the same age) but smoking. Thus, in the second analysis we took age into account, or, in statistical terms, we adjusted for age, whereas the first analysis was, in statistical terms, an unadjusted or crude analysis. We should always be aware of studies with only crude results, because they might be biased/misleading.2
In the example above, age is not the only factor that might influence mortality. Alcohol or drug use, cancer or heart disease, body mass index, or physical activity can also influence death, independently of smoking. How to adjust for all these factors? We cannot do stratified analyses as we did above, because the strata would be too many. The solution is to do a multivariable regression analysis. This is a statistical tool to adjust for multiple factors (or variables) at the same time. When we adjust for all these factors, we are comparing the effect of smoking in people who are the same with regard to all these factors but who differ on smoking status. In statistical terms, we study the effect of smoking, keeping everything else constant. In this way we “isolate” the effect of smoking on death by taking into account all other factors, or, in statistical terms, we study the effect of smoking independently of other factors.
How many factors should be included in a multivariable analysis? As a general rule, the more the better, to reduce confounding. However, the number of variables to include in a regression model is limited by the sample size. The general rule of thumb is that, for every 10 events (for dichotomous outcomes) or 10 people (for continuous outcomes), we can add one variable in the model. If we add more variables than that, then in statistical terms the model becomes overfitted (i.e., it gives results that are specific to that dataset, but may not be applicable to other datasets). Overfitted models can be as biased/misleading as crude models.3
What are we to do about other factors that may affect mortality independently of smoking (e.g., diet), but which are not found in our dataset? Unfortunately, nothing. Since we do not have that information, we cannot adjust for it. In this case, diet is in statistical terms an unmeasured confounder. Unfortunately, in all observational studies there is always at least some degree of unmeasured confounding, because there may be many factors that can influence the outcome (and the exposure) which are not part of the dataset. While some statistical tools have been developed to estimate unmeasured confounding, and therefore interpret the results in its light, unmeasured confounding remains one of the major limitations of observational studies.4
Randomized, controlled trials (RCTs) on the other side do not have this problem in theory. With properly designed RCTs, all confounders, both measured and unmeasured, will be balanced between the two groups. For example, imagine an RCT where some patients are randomized to take drug A or drug B. Because patients are randomly allocated to one group or the other, it is assumed that all other factors are also randomly distributed. Hence, the two groups should be equal to each other with respect to all other factors except our active intervention, namely the type of drug they are taking (A or B). For this reason, in RCTs there is no need to adjust for multiple factors with a multivariable regression analysis, and crude unadjusted results can be presented as unbiased.
There is however a caveat. What happens if one patient who was randomized to take drug A takes drug B instead? Should she still be counted in analysis under drug A (as randomized) or under drug B (as she took it)? The usual practice is to do this and present both. In the first case, we will have the intention-to-treat (ITT) analysis, and in the second case, the per-protocol analysis (PPA). The advantage of the ITT is that it keeps the strength of randomization, namely the balancing of confounders, and therefore can present unbiased results. The advantage of the PPA is that it measures what was actually done in reality. However, in this case there is a departure from the original randomization, and hence there is the possibility of introducing confounding, because now patients are not randomly allocated to one treatment or the other. The larger the departure from randomization, the more probable the introduction of bias/confounding. For example, what if patients with more severe disease took drug A, even though they were randomized to take drug B? That will have an influence the outcome. For this reason, outcomes of the ITT analysis are considered the main results of RCTs, because PPA results can be confounded.
In summary, when reading studies, do not simply accept the results as they are presented, but rather ask yourself: “Could they be confounded by other factors, and therefore be unreliable? What steps did the authors take to reduce confounding? If they presented only crude analyses, and this was not justified by a RCT design, do they recognize it as a major limitation?” There are many nuances in every paper that can be appreciated only through a careful reading of the methods section. Hopefully, this article can shed some light on these issues and help the readers to not be confounded.
References
1. The P value: What to make of it? A simple guide for the uninitiated. GI and Hepatology News. 2019 Sep 23. https://www.mdedge.com/gihepnews/article/208601/mixed-topics/p-value-what-make-it-simple-guide-uninitiated
2. VanderWeele TJ et al. Ann Stat. 2013 Feb;41(1):196-220.
3. Concato J et al. Ann Intern Med. 1993 Feb 1;118(3):201-10.
4. VanderWeele TJ et al. Ann Intern Med. 2017 Aug 15;167(4):268-74.
Dr. Jovani is a therapeutic endoscopy fellow in the division of gastroenterology and hepatology at Johns Hopkins Hospital, Baltimore.
The P value: What to make of it? A simple guide for the uninitiated
Introduction
Many clinicians consider the P value as an almost magical number that determines whether treatment effects exist or not. Is that a correct understanding?
In order to grasp the conceptual meaning of the P value, consider comparing two treatments, A and B, and finding that A is twice as effective as B. Does it mean that treatment A is better in reality? We cannot be sure from that information alone. It may be that treatment A is truly better than treatment B (i.e., true positive). However, it may also be that by chance we have collected a sample in which more people respond to treatment A, making it appear as more effective, when in reality it is equally effective as treatment B (i.e., false positive).How do we discern whether the first or the second scenario is true? The P value can help us with that. Conceptually, the P value can be thought of as the probability of observing these results (A is twice as effective as B) by chance if in reality there is no difference between A and B. It is therefore the probability of having a false-positive finding (also called type I or alpha error).
An arbitrary definition
If the P value is less than 5% (P less than .05) that means that there is less than a 5% probability that we would observe the above results if in reality treatment A and treatment B were equally effective. Since this probability is very small, the convention is to reject the idea that both treatments are equally effective and declare that treatment A is indeed more effective.
The P value is thus a probability, and “statistical significance” depends simply on 5% being considered the cutoff for sufficiently low enough probability to make chance an unlikely explanation for the observed results. As you can see this is an arbitrary cutoff; it could have been 4% or 6%, and the concept would not have changed.1
Power
Thus, simply looking at the P value itself is insufficient. We need to interpret it in light of other information.2 Before doing that, we need to introduce a new related statistical concept, that of “power.” The power of a study can be conceptually understood as the ability to detect a difference if there truly is one. If there is a difference in reality between treatments A and B, then the power of a study is the ability to detect that difference.
Two factors influence power: the effect size (that is, the difference between A and B) and the sample size. If the effect size is large, then even with small samples we can detect it. For example, if treatment A was effective in 100% of the cases, and treatment B only in 10% of cases, then the difference will be clear even with a small number of patients. Conversely, if the effect size is small, then we would need a very large sample size to detect that difference. For example, if treatment A is effective in 20% of cases, and treatment B is effective in 22% of cases, the difference between them could be observed only if we enrolled a very large number of patients. A large sample size increases the power of a study. This has important implications for the interpretation of the P value.
How (not) to interpret the P value
Many clinicians do not consider other factors when interpreting the P value, and assume that the dichotomization of results as “significant” and “nonsignificant” accurately reflects reality.3
Authors may say something like the following: “Treatment A was effective in 50% of patients, and treatment B was effective in 20% of the patients, but there was no difference between them (P = .059).” The reason why they declare this as “no difference” is because there is no “statistically significant difference” if P = .059. However, this does not mean that there is no difference.
First, if the convention for the cutoff value for significance was another arbitrary value, say 0.06, then this would have been a statistically significant finding.
Second, we should pay attention to the magnitude of the P value when interpreting the results. As per definition above, the P value is simply the probability of a false-positive result. However, these probabilities may be greater than 5% with varying degrees. For example, a probability of false positive of 80% (P = .80) is very different from a probability of 6% (P = .059), even though, technically, both are “nonsignificant.” A P value of .059 can be interpreted to mean that there is possibly some “signal” of real difference in the data. It may be that the study above was not powered enough to see the difference of 30 percentage points between the treatments as statistically significant; had the sample size been larger and thus provided greater power, then the finding could have been significant. Instead of reporting that there is no difference, it would be better to say that these results are suggestive of a difference, but that there was not enough power to detect it. Alternatively, P = .059 can be considered as “marginally nonsignificant” to qualitatively differentiate it from larger values, say P = .80, which are clearly nonsignificant.
Third, a key distinction is that between clinical and statistical significance. In the example above, even though the study was not statistically significant (P = .059), a difference of 30% seems clinically important. The difference between clinical and statistical significance can perhaps be better illustrated with the opposite, and more common, mistake. As mentioned, a large sample size increases power, thus the ability to detect even minor differences. For example, if a study enrolls 100,000 participants in each arm, then even a difference of 0.2% between treatments A and B will be statistically significant. However, this difference is clinically irrelevant. Thus, when researchers report “statistically significant” results, careful attention must be paid to the clinical significance of those results. The purpose of the studies is to uncover reality, not to be technical about conventions.
Multiple testing and P value
Finally, another almost universally ignored problem in clinical research papers is that of multiple testing. It is not uncommon to read papers in which the authors present results for 20 different and independent hypotheses tests, and when one of them has a P value less than .05 they declare it as a significant finding. However, this is clearly mistaken. The more tests are made, the higher the probability of false positives. Imagine having 20 balls and only one of them is red. If you pick a random ball only once you have a 5% probability of picking the red one. If, however, you try it 10 different times, the probability of picking the red ball is higher (approximately 40%). Similarly, if we perform only one test, then the probability of a false positive is 5%; however, if we perform many tests, then the probability of a false positive is higher than 5%.
There are three main ways to deal with this problem. The first is to have only one main outcome declaring statistical significance for only that outcome and consider the other outcomes as exploratory. The second is to report on multiple findings and correct for multiple testing. The third is to report on multiple findings, but mention explicitly in the paper that they have not corrected for multiple testing and therefore the findings may be significant by chance.
Conclusion
In summary, the P value is the probability of a false-positive finding, and the cutoff of .05 is arbitrary. Instead of dichotomizing results as “significant” and “nonsignificant” purely based on whether the P value is more or less than .05, a more qualitative approach that takes into account the magnitude of the P value and the sample size should be considered, and multiple testing should be taken into account when declaring significant findings.
Dr. Jovani is a therapeutic endoscopy fellow, division of gastroenterology and hepatology, Johns Hopkins Hospital, Baltimore.
References
1. Guyatt G et al. CMAJ. 1995;152:27-32.
2. Guller U and DeLong ER. J Am Coll Surg. 2004;198:441-58.
3. Greenland S et al. Eur J Epidemiol. 2016;31:337-50.
Introduction
Many clinicians consider the P value as an almost magical number that determines whether treatment effects exist or not. Is that a correct understanding?
In order to grasp the conceptual meaning of the P value, consider comparing two treatments, A and B, and finding that A is twice as effective as B. Does it mean that treatment A is better in reality? We cannot be sure from that information alone. It may be that treatment A is truly better than treatment B (i.e., true positive). However, it may also be that by chance we have collected a sample in which more people respond to treatment A, making it appear as more effective, when in reality it is equally effective as treatment B (i.e., false positive).How do we discern whether the first or the second scenario is true? The P value can help us with that. Conceptually, the P value can be thought of as the probability of observing these results (A is twice as effective as B) by chance if in reality there is no difference between A and B. It is therefore the probability of having a false-positive finding (also called type I or alpha error).
An arbitrary definition
If the P value is less than 5% (P less than .05) that means that there is less than a 5% probability that we would observe the above results if in reality treatment A and treatment B were equally effective. Since this probability is very small, the convention is to reject the idea that both treatments are equally effective and declare that treatment A is indeed more effective.
The P value is thus a probability, and “statistical significance” depends simply on 5% being considered the cutoff for sufficiently low enough probability to make chance an unlikely explanation for the observed results. As you can see this is an arbitrary cutoff; it could have been 4% or 6%, and the concept would not have changed.1
Power
Thus, simply looking at the P value itself is insufficient. We need to interpret it in light of other information.2 Before doing that, we need to introduce a new related statistical concept, that of “power.” The power of a study can be conceptually understood as the ability to detect a difference if there truly is one. If there is a difference in reality between treatments A and B, then the power of a study is the ability to detect that difference.
Two factors influence power: the effect size (that is, the difference between A and B) and the sample size. If the effect size is large, then even with small samples we can detect it. For example, if treatment A was effective in 100% of the cases, and treatment B only in 10% of cases, then the difference will be clear even with a small number of patients. Conversely, if the effect size is small, then we would need a very large sample size to detect that difference. For example, if treatment A is effective in 20% of cases, and treatment B is effective in 22% of cases, the difference between them could be observed only if we enrolled a very large number of patients. A large sample size increases the power of a study. This has important implications for the interpretation of the P value.
How (not) to interpret the P value
Many clinicians do not consider other factors when interpreting the P value, and assume that the dichotomization of results as “significant” and “nonsignificant” accurately reflects reality.3
Authors may say something like the following: “Treatment A was effective in 50% of patients, and treatment B was effective in 20% of the patients, but there was no difference between them (P = .059).” The reason why they declare this as “no difference” is because there is no “statistically significant difference” if P = .059. However, this does not mean that there is no difference.
First, if the convention for the cutoff value for significance was another arbitrary value, say 0.06, then this would have been a statistically significant finding.
Second, we should pay attention to the magnitude of the P value when interpreting the results. As per definition above, the P value is simply the probability of a false-positive result. However, these probabilities may be greater than 5% with varying degrees. For example, a probability of false positive of 80% (P = .80) is very different from a probability of 6% (P = .059), even though, technically, both are “nonsignificant.” A P value of .059 can be interpreted to mean that there is possibly some “signal” of real difference in the data. It may be that the study above was not powered enough to see the difference of 30 percentage points between the treatments as statistically significant; had the sample size been larger and thus provided greater power, then the finding could have been significant. Instead of reporting that there is no difference, it would be better to say that these results are suggestive of a difference, but that there was not enough power to detect it. Alternatively, P = .059 can be considered as “marginally nonsignificant” to qualitatively differentiate it from larger values, say P = .80, which are clearly nonsignificant.
Third, a key distinction is that between clinical and statistical significance. In the example above, even though the study was not statistically significant (P = .059), a difference of 30% seems clinically important. The difference between clinical and statistical significance can perhaps be better illustrated with the opposite, and more common, mistake. As mentioned, a large sample size increases power, thus the ability to detect even minor differences. For example, if a study enrolls 100,000 participants in each arm, then even a difference of 0.2% between treatments A and B will be statistically significant. However, this difference is clinically irrelevant. Thus, when researchers report “statistically significant” results, careful attention must be paid to the clinical significance of those results. The purpose of the studies is to uncover reality, not to be technical about conventions.
Multiple testing and P value
Finally, another almost universally ignored problem in clinical research papers is that of multiple testing. It is not uncommon to read papers in which the authors present results for 20 different and independent hypotheses tests, and when one of them has a P value less than .05 they declare it as a significant finding. However, this is clearly mistaken. The more tests are made, the higher the probability of false positives. Imagine having 20 balls and only one of them is red. If you pick a random ball only once you have a 5% probability of picking the red one. If, however, you try it 10 different times, the probability of picking the red ball is higher (approximately 40%). Similarly, if we perform only one test, then the probability of a false positive is 5%; however, if we perform many tests, then the probability of a false positive is higher than 5%.
There are three main ways to deal with this problem. The first is to have only one main outcome declaring statistical significance for only that outcome and consider the other outcomes as exploratory. The second is to report on multiple findings and correct for multiple testing. The third is to report on multiple findings, but mention explicitly in the paper that they have not corrected for multiple testing and therefore the findings may be significant by chance.
Conclusion
In summary, the P value is the probability of a false-positive finding, and the cutoff of .05 is arbitrary. Instead of dichotomizing results as “significant” and “nonsignificant” purely based on whether the P value is more or less than .05, a more qualitative approach that takes into account the magnitude of the P value and the sample size should be considered, and multiple testing should be taken into account when declaring significant findings.
Dr. Jovani is a therapeutic endoscopy fellow, division of gastroenterology and hepatology, Johns Hopkins Hospital, Baltimore.
References
1. Guyatt G et al. CMAJ. 1995;152:27-32.
2. Guller U and DeLong ER. J Am Coll Surg. 2004;198:441-58.
3. Greenland S et al. Eur J Epidemiol. 2016;31:337-50.
Introduction
Many clinicians consider the P value as an almost magical number that determines whether treatment effects exist or not. Is that a correct understanding?
In order to grasp the conceptual meaning of the P value, consider comparing two treatments, A and B, and finding that A is twice as effective as B. Does it mean that treatment A is better in reality? We cannot be sure from that information alone. It may be that treatment A is truly better than treatment B (i.e., true positive). However, it may also be that by chance we have collected a sample in which more people respond to treatment A, making it appear as more effective, when in reality it is equally effective as treatment B (i.e., false positive).How do we discern whether the first or the second scenario is true? The P value can help us with that. Conceptually, the P value can be thought of as the probability of observing these results (A is twice as effective as B) by chance if in reality there is no difference between A and B. It is therefore the probability of having a false-positive finding (also called type I or alpha error).
An arbitrary definition
If the P value is less than 5% (P less than .05) that means that there is less than a 5% probability that we would observe the above results if in reality treatment A and treatment B were equally effective. Since this probability is very small, the convention is to reject the idea that both treatments are equally effective and declare that treatment A is indeed more effective.
The P value is thus a probability, and “statistical significance” depends simply on 5% being considered the cutoff for sufficiently low enough probability to make chance an unlikely explanation for the observed results. As you can see this is an arbitrary cutoff; it could have been 4% or 6%, and the concept would not have changed.1
Power
Thus, simply looking at the P value itself is insufficient. We need to interpret it in light of other information.2 Before doing that, we need to introduce a new related statistical concept, that of “power.” The power of a study can be conceptually understood as the ability to detect a difference if there truly is one. If there is a difference in reality between treatments A and B, then the power of a study is the ability to detect that difference.
Two factors influence power: the effect size (that is, the difference between A and B) and the sample size. If the effect size is large, then even with small samples we can detect it. For example, if treatment A was effective in 100% of the cases, and treatment B only in 10% of cases, then the difference will be clear even with a small number of patients. Conversely, if the effect size is small, then we would need a very large sample size to detect that difference. For example, if treatment A is effective in 20% of cases, and treatment B is effective in 22% of cases, the difference between them could be observed only if we enrolled a very large number of patients. A large sample size increases the power of a study. This has important implications for the interpretation of the P value.
How (not) to interpret the P value
Many clinicians do not consider other factors when interpreting the P value, and assume that the dichotomization of results as “significant” and “nonsignificant” accurately reflects reality.3
Authors may say something like the following: “Treatment A was effective in 50% of patients, and treatment B was effective in 20% of the patients, but there was no difference between them (P = .059).” The reason why they declare this as “no difference” is because there is no “statistically significant difference” if P = .059. However, this does not mean that there is no difference.
First, if the convention for the cutoff value for significance was another arbitrary value, say 0.06, then this would have been a statistically significant finding.
Second, we should pay attention to the magnitude of the P value when interpreting the results. As per definition above, the P value is simply the probability of a false-positive result. However, these probabilities may be greater than 5% with varying degrees. For example, a probability of false positive of 80% (P = .80) is very different from a probability of 6% (P = .059), even though, technically, both are “nonsignificant.” A P value of .059 can be interpreted to mean that there is possibly some “signal” of real difference in the data. It may be that the study above was not powered enough to see the difference of 30 percentage points between the treatments as statistically significant; had the sample size been larger and thus provided greater power, then the finding could have been significant. Instead of reporting that there is no difference, it would be better to say that these results are suggestive of a difference, but that there was not enough power to detect it. Alternatively, P = .059 can be considered as “marginally nonsignificant” to qualitatively differentiate it from larger values, say P = .80, which are clearly nonsignificant.
Third, a key distinction is that between clinical and statistical significance. In the example above, even though the study was not statistically significant (P = .059), a difference of 30% seems clinically important. The difference between clinical and statistical significance can perhaps be better illustrated with the opposite, and more common, mistake. As mentioned, a large sample size increases power, thus the ability to detect even minor differences. For example, if a study enrolls 100,000 participants in each arm, then even a difference of 0.2% between treatments A and B will be statistically significant. However, this difference is clinically irrelevant. Thus, when researchers report “statistically significant” results, careful attention must be paid to the clinical significance of those results. The purpose of the studies is to uncover reality, not to be technical about conventions.
Multiple testing and P value
Finally, another almost universally ignored problem in clinical research papers is that of multiple testing. It is not uncommon to read papers in which the authors present results for 20 different and independent hypotheses tests, and when one of them has a P value less than .05 they declare it as a significant finding. However, this is clearly mistaken. The more tests are made, the higher the probability of false positives. Imagine having 20 balls and only one of them is red. If you pick a random ball only once you have a 5% probability of picking the red one. If, however, you try it 10 different times, the probability of picking the red ball is higher (approximately 40%). Similarly, if we perform only one test, then the probability of a false positive is 5%; however, if we perform many tests, then the probability of a false positive is higher than 5%.
There are three main ways to deal with this problem. The first is to have only one main outcome declaring statistical significance for only that outcome and consider the other outcomes as exploratory. The second is to report on multiple findings and correct for multiple testing. The third is to report on multiple findings, but mention explicitly in the paper that they have not corrected for multiple testing and therefore the findings may be significant by chance.
Conclusion
In summary, the P value is the probability of a false-positive finding, and the cutoff of .05 is arbitrary. Instead of dichotomizing results as “significant” and “nonsignificant” purely based on whether the P value is more or less than .05, a more qualitative approach that takes into account the magnitude of the P value and the sample size should be considered, and multiple testing should be taken into account when declaring significant findings.
Dr. Jovani is a therapeutic endoscopy fellow, division of gastroenterology and hepatology, Johns Hopkins Hospital, Baltimore.
References
1. Guyatt G et al. CMAJ. 1995;152:27-32.
2. Guller U and DeLong ER. J Am Coll Surg. 2004;198:441-58.
3. Greenland S et al. Eur J Epidemiol. 2016;31:337-50.