Post-hoc power – Alone in the universe

Prologue

Firstly, it is important to describe what post-hoc power really means. Post-hoc power is the power of the test using effect size observed and sample size used in the study. For this reason post-hoc power is often labeled as observed power.

If for any experiment the effect size, parameter estimates and sample size is known, p-value of test statistic can be calculated. Hence, post-hoc power and p-value are reciprocally associated in 1:1 relationship. The magic of post-hoc power is described really nicely by Zad Chow in his blog post [1].

Post-hoc or observed power crept in to the scientific literature probably in the 1980’s. Some journals started to promote reporting of post-hoc power in their publications. In some journals the authors´ instructions demanded that post-hoc power should be reported when ever a nonsignificant result was reported. Yuan and Maxwell (2005) wrote:

For example, the instructions to authors for Animal Behavior state that “Where a significance test based on a small sample size yields a nonsignificant outcome, the power of the test should normally be quoted in addition to the P value” (2001, vi).

Ref [2]

Current instructions of the Animal Behaviour (AB) states:

Providing a value for power based on a priori tests is preferred. Values of observed power are not appropriate.

Ref [3]

It seems that the AB journal wants to take distance from some poorly defined instructions in the past.

Based on a short and non-systematic literature review it seems that post-hoc power has been very rarely promoted in actual peer-reviewed literature. SD Smith, an ophthalmologist and MD/MPH wrote in 2008:

In cases in which statistical significance is not achieved, a post hoc power calculation based upon the actual data (rather than the assumptions made before data collection) and a particular clinically meaningful treatment effect can also be useful to estimate the prob-ability that type II error has occurred. If this probability is large, then additional data are likely to be needed to clarify whether the treatment is efficacious.

Ref [4]

One factor which presumably has promoted the use of post-hoc power is the statistical software SPSS. When selecting output parameters in SPSS, for example in general linear model, SPSS has an option to include observed power which is the same as post-hoc power.

Observed power reported in the SPSS output is discussed in many references:

Similarly, Pallant (2001) states “Some of the SPSS programs also provide an indication of the power of the test that was conducted, taking into account effect size and sample size. If you obtain a non-significant result and are using quite a small sample size, you need to check these power values ” (p 173).

Ref [2]

Same principles are outlined in a book published just last year:

Ref [5]

Debunking post-hoc power

As one might think, there is a vast bulk of literature about post-hoc power written by experts and professionals in statistics. Goodman and Berlin wrote already in 1994:

For interpretation of observed results, the concept of power has no place, and confidence intervals, likelihood, or Bayesian methods should be used instead.

Ref [6]

Probably the most influential paper on topic was published in 2001 by Hoenig and Helsey. They state:

Observed power can never fulfill the goals of its advocates because the observed significance level of a test (“p value”) also determines the observed power; for any test the observed power is a 1:1 function of the p value

Ref [7]

Power calculations tell us how well we might be able to characterize nature in the future given a particular state and statistical study design, but they cannot use information in the data to tell us about the likely states of nature

Ref [8]

Similar principles has been outlined thereafter countless times. Lenth (2007) states:

The tables in this article demonstrate clearly that PHP is just a re-expression of the P value; and in fact, once one gets past 20 degrees of freedom or so (for the denominator), PHP does not even depend much on sample size for a given type of test. Thus, as a retrospective measure of the results of the current study, PHP is just elaboration, not new information.

Ref [8]

If that is not clear enough let´s take a final quote from O´Keefe (2010):

So where post hoc power refers to “the power of the test assuming a population effect size exactly equal to the observed sample effect size,” such power figures do not provide much helpful information.

Ref [9]

It is evident that power calculation using the observed effect has absolutely no use after study data is made available. My take is that the proponents of the use of post-hoc power must have some sort of misconception or misunderstanding of basic concepts in hypothesis testing or statistical power.

Modern history of post-hoc power

In April 2018, a surgical perspective was published in the Annals of Surgery. Annals is the second highest ranking journal in the field of surgery with an IF of 9.476 as of 2018. This perspective was titled “A Proposal to Mitigate the Consequences of Type 2 Error in Surgical Science” [10] and it was published by a group of surgeons from the respected Massachusetts General Hospital in Boston, USA.

In this perspective, the authors discuss a very important topic, namely type 2 error in surgical trials. Type 2 error means that a significant difference was not observed when in fact there is a true difference to be found between groups. Authors rightfully state:

The belief that the absence of difference is synonymous with evidence of equivalence is erroneous.

Ref [10]

What follows this is something which can be labeled as a terrible effort to revive the infamous post-hoc power.

But, as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%—with the given sample size and effect size observed in that study.

Ref [10]

Indeed, these authors suggest that future studies should include disclosure of post-hoc power when a nonsignificant results is reported. They also state:

For example, instead of merely reporting a finding of “no difference,” future surgeon-scientists should report a negative result with a disclosure of power, “no difference with a power of X”.

Ref [10]

This is very similar to that which many journals promoted some 20 years ago. This proposal resulted to numerous responses around the world. I and my colleague Olli Helminen responded within few weeks. We just restated what was already written before:

Post-hoc power is nothing more than a report of P value in a different way and therefore provides no answer to type 2 error. Nonsignificant studies always have low observed power.

Ref [11]

Another responses by Andrew Gelman and a group of surgeons from Utrecht University Hospital also followed very soon [12,13]. Authors of the proposal responded to criticism although it took some time. The title of their response was very bold: “Post Hoc Power: A Surgeon’s First Assistant in Interpreting “Negative” Studies” [14]. Just based on the title it was quite obvious what the authors though about the criticism. They stated:

We respectfully disagree that it is wrong to report post hoc power in the surgical literature. We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect. As such, we believe that post hoc power is not wrong, but instead a necessary first assistant in interpreting results.

Ref [14]

They continue:

In fact, others have reported that in some circumstances, post hoc power calculations are a useful supplement to P values (9).

Ref [14]

They cite the paper by O´Keefe in which O`Keefe clearly writes:

But where after-the-fact power analyses are based on population effect sizes of independent interest (as opposed to a population effect size exactly equal to whatever happened to be found in the sample at hand), they can potentially be useful.

Ref [9]

It seems that the authors had a complete mix-up of the different power concepts. Another statistical gem in their response was the following:

Given the uniqueness of surgical science, we assert that reporting both P value and power improves communication of the limitation of results: the P value is reflexively associated with the risk of type 1 error, while the power alerts the general audience of the risk of type 2 error.

Ref [14]

I have no words. Anyway, both I and my colleague and Andrew Gelman responded again [15,16]. Related to the citation above we wrote:

Bababekov and Chang continue with ‘‘In fact, others have reported that in some circumstances, post-hoc power calculations are a useful supplement to p-values’’ and cite the paper by O’Keefe as we did. (5) We suggest the authors to read and assimilate associated literature before making such claims. In the cited paper, (5) O’Keefe writes, ‘‘In any case, the larger point is that after-the-fact power analyses can sometimes be a useful supplement to p-values and confidence intervals, but only when based on population effect magnitudes of independent interest.’’ Bababekov and Chang do not seem to grasp the idea of after-the-fact power.

Ref [15]

Authors responded again, after almost a half of a year. We send at least two queries to editor asking about the authors’ response. They finally responded, but this time the discussion had gone off the track. They concluded in their response:

We hope that this healthy exchange of ideas will bring out others with communication expertise to join forces in exploring additional solutions beyond the study design stage.

Ref [17]

It was futile to continue the discussion.

Story continues

In April 2019 the investigators from MGH published something really controversial. Their study published in Journal of Surgical Research was titled “Is the Power Threshold of 0.8 Applicable to Surgical Science?—Empowering the Underpowered Study” [18]. They had searched surgical RCTs and observational studies, then took those which had nominally statistically insignificant result and reported the post-hoc power of these studies.

The post hoc power for the primary outcome of each article was calculated given the observed effect size for each study.

Ref [18]

Besides using and reporting results obtained with flawed and useless statistic their study has one strange issue. They selected only studies with non-significant findings and calculated post-hoc power for these studies. As it has been clearly shown, due to the 1:1 relationship with p-value and post-hoc power, all studies with a p-value less then 0.05 has a post-hoc power less than 50%. They report, however, that some 6 out of 69 studies with nonsignificant findings had post-hoc power higher that 50%.

Response from the research community was immediate. Study was widely discussed in Twitter [19], discussion forums [20] and blogs [21]. Dozens of comments were submitted to PubPeer [22]. Numerous experts and professionals in statistics demanded a retraction of this article.

Few months after the article was published in the JSR website, it was noted that the PDF-version of the article had been changed without any correction note.

Soon after an explanation was given and now the supplementary material is again available.

Authors have not clearly responded to any of the criticism raised. RetractionWatch wrote a story [23] about this all in which the authors have responded in some way.

Senior author David Chang feels “online trolls” are presenting one-sided arguments, which do not justify retraction.

Chang believes statisticians do not appreciate the practical context in which his paper was written.

Ref [23]

We will see whether labeling critics as “trolls” continues in scientific communication.

Closing remarks

It seems that statistical review of scientific papers has been poor also in another field and even in major medical journals. Just this year the following was stated in a paper published in JAMA Internal Medicine:

We conducted a post hoc power calculation comparing 30-day mortality between the high-bioavailability and low-bioavailability groups in our cohort but had only 11% power to detect the difference that we observed with our study sample size. All studies to date, including ours, remain underpowered to determine whether the bioavailability of oral agents is less important for uncomplicated Enterobacteriaceae bacteremia after an appropriate clinical response has been observed and source control has been achieved.

Ref [24]

At least one response commenting this issue has been submitted to journal website [25].

At this moment, there is a perspective in the Annals of Surgery which recommends adding disclosure of post-hoc power to CONSORT and STROBE guidelines and an article investigating post-hoc power in surgical literature thereby promoting its use. Both the perspective and the article mentioned are just nonsense. Understandingly, the criticism, especially from statistical community, has been massive. Near future will show how things will turn out with this one.

Edit 5/7/2019: Minor problem in ref numbering corrected.

References

  1. http://lesslikely.com/statistics/observed-power-magic
  2. Yuan & Maxwell (2005),
    https://journals.sagepub.com/doi/10.3102/10769986030002141
  3. Animal Behaviour. Guide for Authors.
    https://www.elsevier.com/journals/animal-behaviour/0003-3472/guide-for-authors
  4. Smith (2008).
    https://www.aaojournal.org/article/S0161-6420(07)01361-9/fulltext
  5. George, Mallery (2018). Book. Direct link to chapter.
  6. Goodman & Berlin (1994).
    https://annals.org/aim/article-abstract/707593/use-predicted-confidence-intervals-when-planning-experiments-misuse-power-when?volume=121&issue=3&page=200
  7. Hoenig & Heisey (2001).
    https://www.tandfonline.com/doi/abs/10.1198/000313001300339897
  8. Lenth (2007).
    https://stat.uiowa.edu/sites/stat.uiowa.edu/files/techrep/tr378.pdf
  9. O´Keefe (2010).
    https://www.tandfonline.com/doi/abs/10.1198/000313001300339897
  10. Bababekov et al. (2018). https://journals.lww.com/annalsofsurgery/Fulltext/2018/04000/A_Proposal_to_Mitigate_the_Consequences_of_Type_2.6.aspx
  11. Helminen & Reito (2018). https://journals.lww.com/annalsofsurgery/Fulltext/2019/01000/Letter_to_Editor__A_Proposal_to_Mitigate_the.47.aspx
  12. Gelman (2019). https://journals.lww.com/annalsofsurgery/Fulltext/2019/01000/Don_t_Calculate_Post_hoc_Power_Using_Observed.46.aspx
  13. Plate et al. (2019). https://journals.lww.com/annalsofsurgery/Fulltext/2019/01000/Post_Hoc_Power_Calculation__Observing_the_Expected.48.aspx
  14. Bababekov & Chang (2019).https://journals.lww.com/annalsofsurgery/Fulltext/2019/01000/Post_Hoc_Power__A_Surgeon_s_First_Assistant_in.49.aspx
  15. Helminen & Reito (2019). https://journals.lww.com/annalsofsurgery/Citation/publishahead/Comment_on__Post_Hoc_Power__A_Surgeons_s_First.95171.aspx
  16. Gelman (2019). https://journals.lww.com/annalsofsurgery/Citation/publishahead/Post_hoc_Power_Using_Observed_Estimate_of_Effect.95343.aspx
  17. Bababekov et al. (2019). https://journals.lww.com/annalsofsurgery/Citation/publishahead/Response_to_the_comment_on__Misinterpretation_of.95142.aspx
  18. Bababekov et al. (2019). https://www.journalofsurgicalresearch.com/article/S0022-4804(19)30185-4/fulltext
  19. https://plu.mx/plum/a/?doi=10.1016/j.jss.2019.03.062&theme=plum-jbs-theme&hideUsage=true
  20. https://discourse.datamethods.org/t/observed-power-and-other-power-issues/731
  21. https://statmodeling.stat.columbia.edu/2019/05/04/post-hoc-power-pubpeer-dumpster-fire/
  22. https://pubpeer.com/publications/4399282A80691D9421B497E8316CF6
  23. https://retractionwatch.com/2019/06/19/statisticians-clamor-for-retraction-of-paper-by-harvard-researchers-they-say-uses-a-nonsense-statistic/
  24. Tamma et al. (2019). https://jamanetwork.com/journals/jamainternalmedicine/article-abstract/2720756?resultClick=1
  25. Timbrook & Burnham (2019). https://jamanetwork.com/journals/jamainternalmedicine/article-abstract/2734932

Leave a Reply

Your email address will not be published. Required fields are marked *