Thursday, 23.03.2017

Comments by Leslie Morey

In reading Dr. de Leon’s interesting introduction to statistical concepts in clinical pharmacology, I was struck by some of the differences between the application of these concepts and the approach typical of my work in measurement in clinical psychology.  I thus offer a few comments in the hope of helping to elaborate some of the important statistical points raised by Dr. de Leon.

My first comment involves the use of “number needed to treat” (NNT) as an indicator of treatment effect.  This concept was introduced to provide a clinically meaningful indicator of effect size translated into units representing patients treated, which has particular advantages for estimating cost/benefit ratios of particular treatments.   However, in my work, I would rarely emphasize the use of NNT, in large part because of drawbacks associated with dichotomous scoring (implying “cured” vs. “not cured”) of treatment improvement; this dichotomy almost never applies to mental health treatment because such a strategy leads to a large loss of information about patient change.   To illustrate some of the problems with a dichotomizing approach, consider the “50% reduction on the Hamilton Rating Scale for Depression (HAM-D)” guideline often used as a benchmark of depression treatment response.  First, the standard error of measurement of the HAM-D (which can be estimated using various reliability estimates, such as inter-rater reliability) will be sufficiently large so that patients with, for example, a 51% reduction on the HAM-D and those with a 48% reduction are not meaningfully different—yet one is classed as a successful treatment and the other is not.  Additionally, consider two patients who might be classified with “severe depression” at baseline (Zimmerman et al., 2013):  Patient A with a baseline HAM-D score of 32 and Patient B with a baseline of 24.   If, at treatment termination, Patient A presents a score of 16 and Patient B scores 14, then Patient A is considered to have responded to treatment while B has not—despite the fact that Patient A continues to demonstrated greater depressive symptomatology than Patient B.  Furthermore, the decision to use a “50% reduction” is only one option; choosing this over other options is somewhat arbitrary, yet these different options lead to different NNT estimates depending upon one’s choice (as shown in slide 61 of the presentation).   Note that the drawbacks associated with the dichotomization of treatment response apply to the odds ratio as well. 

A different set of issues with NNT arises, not from the assumption of dichotomous treatment response, but from the scaling of the metric.   The way that NNT is scaled makes the value discontinuous between 1 and -1; furthermore, the null hypothesis of “no treatment effect” is not either -1 or 1, it is + or – infinity.  As such, the use of confidence intervals with this metric is problematic.   For example, Kraemer and Kupfer (2006), discussing the relationship between the success rate with active treatment (sT) and control (sC), point out that “When sT and sC are similar, the difference in success rates in different studies might wobble back and forth across zero and thus NNT between positive and negative infinity. Because of this instability, to obtain confidence intervals or to use NNT in meta-analysis might produce peculiar results and is not advisable.”  Thus, the interpretations of NNT and statistical significance recommended by Dr. de Leon on, for example, slide 39 of the presentation, are potentially problematic because they are based upon confidence intervals of NNT that are themselves problematic.   My recommendation would be to use the typical statistic test suitable for the analysis in question (e.g., a chi-square test) to determine statistical significance, and use NNT not as a significance test but as an expression of effect size that may have particular utility for cost/benefit analyses.

My second set of comments address what Dr. de Leon describes as the “standardized mean difference” (SMD), which is a metric that I use very frequently as a measure of effect size and which I believe has numerous advantages over NNT for mental health applications.  First, I would note that many in the field use the term “Cohen’s d” (e.g., as described in Cohen et al., 1988) rather than SMD, and readers should be aware that these refer to the same metric and that both terms are likely to be encountered quite often in the treatment literature.  I also would encourage that readers should familiarize themselves with the meaning of this statistic; it represents:



                                   Mean (treatment) – Mean (control)          

               d =            ___________________________________

                                        Pooled standard deviation


Simply put, the d (or SMD) is the difference between treatment group mean and control group mean (typically at post-treatment, assuming random assignment), divided by the pooled standard deviation—or the difference in group means expressed in standard deviation units.   SMD is purely a measure of effect size and cannot be used in isolation to determine statistical significance.  The CI around SMD, however, is affected by the sample size, and thus inclusion of that information provides information needed for significance testing.   It is worth noting that the CI’s for SMD that Dr. de Leon includes in his presentation (e.g., slide 77) thus reflect the sample size of the study—as the sample size gets larger, the CI around the SMD becomes smaller.   As such, statistical significance is a combined function of effect size (SMD) and sample size (influencing our confidence in replicability); large effects can be statistically significant with relatively small samples, while very small effects can be statistically significant with very large samples.

One advantage of recognizing that SMD and Cohen’s d are the same metric is that Cohen himself offered some oft-repeated conventions about this effect size where an effect size of 0.2 to 0.3 might be considered to represent a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect (Cohen, 1988; 1992).  However, it should be noted that Cohen himself pointed out that these rules-of-thumb are problematic if applied blindly.   As a further consideration, note that a “meta-analysis of meta-analyses” conducted by Lipsey and Wilson (1993) found that many established treatments (including medical, psychological, and educational interventions) demonstrate effects that are “small” by these conventions, yet nonetheless reflect best practices.

In summary, Dr. de Leon provides an excellent overview of many important issues and concepts that are needed to be a sophisticated consumer of the treatment literature.  Hopefully my comments have helped to provide additional context for interpreting NNT and SMD as metrics of treatment effect.



Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd Ed.). New York: Academic Press.

Cohen, J.  (1992). A power primer. Psychological Bulletin, 112, 155-159.

Kraemer, H. C., & Kupfer, D. J. (2006). Size of treatment effects and their importance to clinical research and practice. Biological Psychiatry, 59(11), 990-996.

Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 1181–1209.

Zimmerman, M.,Martinez, J.H.,Young, D.,Chelminski, I., & Dalrymple, K. (2013).  Severity classification on the Hamilton depression rating scale. Journal of Affective Disorders, 150(2), 384–388


Leslie Morey
January 7, 2016