Thomas A. Ban
Neuropsychopharmacology in Historical Perspective.

Education in the field in the Post-Psychopharmacology Era

Collated 15


Per Bech:  Clinical Psychometrics (Oxford: John Wiley & Sons Ltd; 2012. (202 pages)


Per Bech                    July 11, 2013                         review

Martin Katz                August1 2013                         comment

Per Bech                    August 29, 2013                     reply to Katz

Donald Klein              October 31, 2013                   comment

Martin Katz                December 1, 2013                 comment on Klein’s comment

Per Bech                    January 14, 2014                   reply to Klein’s comment

Don Klein                   March 27, 2014                      response to Bech’s reply and  

                                                                              reply to Katz’s comment

Martin Katz                April 17, 2014                        question to Klein

Donald Klein              April 24, 2014                        answer to Katz’s question

Per Bech                   May 29, 2014                          comment on  Klein’s answer

Martin Katz               June 5, 2014                           response to Klein’s answer

Hector Warnes          November 3, 2016                   comment

Per Bech                   January 26, 2017                     reply to ‘Warnes’ commentary

Aitor Castillo             Dec 22, 2016                           comment

Per Bech                   February 9, 2017                     reply to Castillo’s comment


Review by Per Bech

        The Danish original of this monograph was published by Munksgaard, Copenhagen, in 2011 with the title: Klinisk psykometri.

INFORMATION ON CONTENTS: Clinical Psychometrics is divided into 10 parts, including the “Introduction,” “Summary and perspectives” (part 9), and the last part (part 10) with the title “Who is carrying Einstein’s baton?” Part 1 deals with “classical psychometrics” (Kraepelin, Spearman, Hotelling, Eysenck); Part 2 with “modern psychiatry - DSMIV/ICD-10”; Part 3 with “modern dimensional psychometrics” (Fischer, Rasch, Siegel, Mokken); and Part 4 with “modern psychometrics – item categories” (Likert, Overall, Cohen).  From the remaining four parts in three, “the clinical consequences of IRT (item response theory) analysis” is discussed, and in one (Part 8), the possibility of using “questionnaires” as “blood tests” is addressed. From the three parts which deal with the “clinical consequences of IRT analysis,” one (Part 5) is dedicated to the “pharmacopsychometric triangle,” another (Part 6), to “health related quality of life” and the third to the “concept of stress.” The volume is complemented by a “glossary,” “appendices,” “references” and an “index.” 

AUTHOR’S STATEMENT: The central concept of this book is the “Pharmacopsychometric Triangle” in which (A) covers the desired clinical effect of a drug, (B) the unwanted, or side effects produced by the drug, and (C) the patient-reported quality of life as a balance between (A) and (B), covering the “mental dimension of health.”

        The measurement-based care, as evaluated by the “Pharmacopsychometric Triangle,” is discussed in this book within the frame of reference to brief clinically and psychometrically valid scales, where the total scores are “sufficient statistics.” “Effect-size” (response) and “number needed to treat” (remission) are two other important statistics.

        From a pure psychometric point of view, it is the “item response theory model” and not “factor analysis” that is recommended for showing that the summed total score of a brief scale is a “sufficient statistic.” The item response theory model provides descriptions of the difficulty to rank order each individual item in a scale to secure by addition a total score.

        This is the first book that identifies, by the “item response theory model,” the valid scales for measuring the effects of “antipsychotic,” “antimanic,” “antidepressant” and “antianxiety” activity of drugs within the “Pharmacopsychometric Triangle.” The author was first, in the late 1970s, to introduce the “item response theory model” in clinical psychiatry.

        In the Appendix of this book, Max Hamilton’s lecture from 1977 on the clinical validity of depression scales is presented. It was one of Hamilton’s most important presentations that has never been published internationally. The Appendix also includes illustrations of both, the “item response theory model” and “principal component analysis.”

July 11, 2013


Martin M. Katz’s comment

        There are several somewhat unusual aspects to Per Bech’s book  Clinical Psychometrics. First, despite the great need for a historical treatment of how the relatively new science, neuropsychopharmacology, developed quantified methods for psychopathology and the capacity to measure treatment-induced change, no one has come forth to do this important job. Bech not only provides the historical perspective but he manages by surveying recent research to sort out the various rating and other psychological methods that have been developed over several decades, highlighting the continuing controversies that exist in regard to measurement strategy and technical details that underlie method development.  We expect a psychologist to write this type of book.  It is unusual, of course, that Bech, as a psychiatrist, has fortunately most of the skills to carry off this very complicated task.

        This is not a book, however, that psychiatrists will rush to buy.  They are not generally comfortable with quantifying their clinical judgments and have rather little exposure to any training in this area.  Contrary to the general belief that psychologists have paved the way for the construction and acceptance of rating methods in clinical research, Bech presents another view.  He identifies Kraepelin and Hamilton, two of the most prominent psychiatrists on the world scene, as the leaders here.  By making a case for that conclusion, he might inadvertently enlist a great many psychiatrists in the further development of this field.  Bech actually balanced this view in the text by also describing the prominence of Galton, Spearman, Eysenck, the contributions of Maurice Lorr and John Overall and several other psychologists. To fully appreciate what is covered, e.g., which scales are currently available and what they are capable of measuring, we note the clarity with which he presents this information and his particular perspective on the right kind of strategy and associated technologies for constructing these instruments.

        Bech classifies test development into two periods, the “classical” and “modern”. In describing factor analysis he contrasts supporters of the two factors versus those who rely on rotations and thereby uncover a multi-factorial structure of psychopathology.  Beyond that he cites limitations of factor analysis, generally, pointing out that it cannot be used to validate phenomena, and more importantly, is not designed to develop methods, but only to provide classification of variables.  He appears to be convinced that the age of factor analysis is over and that the field should move on to the use of the "item response" model.  He sees the latter method as better suited to solving the problems in this field.  I am not sure here, however, that his glossary definitions of “validity,” which stress clinical significance and unidimensionality, correspond to the commonly accepted psychometric definition; i.e., the simpler notion that validity is the extent to which a method measures what it purports to measure.  I, therefore, think I understand his stance on the number of factors, but take issue with his conclusion.  He, like Max Hamilton and Pierre Pichot, appears committed to brief scales and the two factor approach.  Those on the other side of the issue conceive of each of the disorders as multifaceted and utilize factor analysis to uncover their dimensional structures.

        Thus, the factor analysis view it as a data reduction method aimed at uncovering the two or more components that can most parsimoniously explain what the method is actually measuring.  Further, when the disorder is conceived to be multidimensional, it is then necessary to identify each of the components, and from the factor analysis results, create ways of quantifying them.  Currently, that is done through principle components analysis and rotation.  Bech presents thoughtful views on these matters but does not do justice to the multifactor approach.  A historical example of the contrasting lines of thinking here is where he focuses on the Hamilton Depression Scale (Ham-D) and the Brief Psychiatric Rating Scale (BPRS), but provides limited information on their predecessors, the Wittenborn Psychiatric Scales and Lorr’s Inpatient Multidimensional Psychiatric Scales (IMPS), both multifactorial scales.  In these two cases, the authors’ targets were the facets of psychopathology and the importance of developing a set of items for each of these facets.  The basic psychometric principle followed was that more reliable and valid measures of the components, e.g., “anxiety,” can be achieved by having the judges rate a set of observed behaviors that reflect that component, than by having an observer rate a more complex, global concept  such as “anxiety.”  It was Lorr, as Bech points out, who wrote and compiled the 63 items and determined the factor structure of psychopathology.  Overall and Gorham used Lorr’s factors to craft global definitions based on interpretation of his factor items, in order to create their 16 “global” items for the original BPRS.  We are aware of how well the BPRS, used in hundreds of studies, worked these many years, particularly in the evaluation of change in overall severity of the disorder in drug trials. But when it comes to reliably and validly measuring the dimensions of psychopathology, equally important in the science, the IMPS is a more effective instrument and applicable to a wider range of problems in clinical research.

        This was perhaps the only shortcoming I could find in this otherwise balanced and clear-headed judgment of the major issues in our field.  For psychiatry, Bech highlights in reviewing the history of the rating scales, that Kraepelin constructed his own rating method, to be followed by scales developed and modified by Max Hamilton and Pierre Pichot, all three attempting to create a functioning science for psychiatry.  Their focus on the importance of scales will no doubt surprise psychiatrists and may prove a positive influence on their approach to them in clinical practice.  The history Bech presents is inspiring.  Not only does he elevate rating scales in the minds of researchers and clinicians, but he also, following philosophers Jaspers and Wittgenstein, in restoring respect for the phenomenologic approach to characterizing the nature of psychopathology.

        I heartily recommend this book as a text for Clinical Methods courses for psychologists and psychiatrists.  I view Per Bech’s effort as filling a significant gap in the practice of current clinical research and an important contribution to the science of psychopathology.


August 1, 2013


Per Bech’s reply to Martin M. Katz’s comment

        In the scientific game, the dialogue between the author of a submitted manuscript and the journal’s referees is quite essential.

        Traditionally, however, the dialogue between the author of a book and an invited reviewer of this book is considered outside the scope of this game, and quite inappropriate. The progressive aspect of the INHN is to break down this tradition, so as to increase the interest in scientific games in the name of research.

        Martin Katz’ review of my Clinical Psychometrics has exactly captured my reason for publishing this book, namely by its reference to Kraepelin, Hamilton and Pichot to awaken the minds of clinical psychiatrists to the uses of rating scales.  Generally, psychiatrists are, as pointed out by Martin Katz, uncomfortable with the quantifying of their clinical judgments and they often have rather little exposure to any training with these scales.  A shortcoming of my book, as stated by Martin Katz, is the treatment of factor analysis.  With reference to Ockham's razor, i.e., the principle of simplicity (the law of parsimony), I have preferred to focus on the first two components when interpreting the results from a principal component analysis.  My book is essentially concerned with the psychometric validation procedure which demonstrates whether items in a rating scale can objectively measure dimensions of clinical severity (e.g. degrees of schizophrenicity, degrees of depressiveness or degrees of neuroticism).  In this connection, item response models are the psychometric validation of measurement to be used when demonstrating these dimensions of severity, i.e., that the total score of a rating scale is a sufficient statistic.

        The importance of factor analysis or principal component analysis does not lie with the measurement issue but, as stated by Martin Katz, in identifying the multi-facets of, for example, depression.  In patients with treatment-resistant depression, I have actually used principal component analysis and identified a principal component which encompass concentration problems, fatigability, lassitude and sleep problems.  Furthermore, I employed item response theory models to measure the severity of this neuropsychiatric or neuropsychological syndrome.  The importance of factor analysis is its ability to explore for new dimensions, but the clinical relevance is outside the explorative nature of factor analysis.

        Martin Katz has for many years, as a psychologist, worked very close together with psychiatrists in the field of psychopharmacology, especially in depression.  I appreciate his very balanced review of my book.  It has really been my goal to: “fill a significant gap in the practice of current clinical research.”


August 29, 2013


Donald F. Klein’s comment

        Per Bech's remarkable book has been outlined by its author, commented on by Martin Katz and replied to by Bech, who emphasizes the value of continuing critical dialog.  These remarks continue this thread.

        Clinical Psychometrics floods this reviewer with many contextual memories.  When, in the 1950s, the paradigm destroying antipsychotic effects of chlorpromazine were first noted, they incited a storm of disbelief.  There were many independent replications of anti-psychotic benefit, however, to scientifically verify that these observations were not clinical fabrications, the quite recent technology of the randomized, double-blind, clinical trial was employed.

        However, massive criticism, mostly by objectivity averse psychoanalysts, argued for objective diagnostic and clinical change measures, probably in the dim hope that objectivity was impossible.  At that time, discerning objective manifestations of psychiatric disease was impossible.  In current psychiatry, objective measures are still ambivalently regarded as shown by their absence in DSM 5, despite NIMH’s fevered search for biomarkers.

        However, if independent raters agreed with each other, then there had to be something observable out there, that allowed more than chance agreement.  Rater agreement (reliability) then served as a surrogate for objective description.  However, ill-defined accusations of lack of validity without specification of the multiplicity of validity criteria, served to derogate systematic observation.

        Bech, using pithy summaries, explains the foundational observational and analytical work of Wundt, Kraepelin, Spearman, Galton, Pearson, Fisher, Eysenck, Hamilton and Pichot, among others.

        Strikingly, Bech argues that the ubiquitous factor analysis does not provide appropriate measures of change or a foundation for diagnosis.  This critically challenges much current work, as well as the NIMH sponsored Research Domain Criteria (RDoC) manifesto for dimensional primacy via multivariate analysis.

        The more “modern” (since the 1970s!!) psychometric developments sparked by Rasch, Guttman and others, is generally labeled Item Response Theory (IRT).  Bech holds these produce the only appropriate severity measures.

        Guttman defined a hierarchy aimed at producing a unidimensional severity scale, based on the proportion of subjects endorsing each item.  Since items endorsed by most subjects are easy (less pathological), whereas rarely endorsed items are difficult (very pathological), if an item of specified severity is endorsed, then all easier items should also be endorsed.  Each potentially useful item is mathematically evaluated to see if it consistently takes its place in such a hierarchy.  Items that are endorsed by the few, but not by the many, just don’t fit, although they may be useful for other purposes.  Change is determined by differences in Guttman defined severity.  This exposition seems quite clear, even if the mathematics is well beyond me.

        This fundamental Rasch analysis is unique in that its item pool is initially selected by expert psychiatrists, as reflective of a particular syndrome.  Rasch analysis produces a severity scale, not a diagnostic scale.  Bech holds that such a scale sufficiently describes an individual’s degree of severity by its total.  This is not the case for familiar, but multi-dimensional, indices such as the Hamilton 17 item scale.

        Factor analyses depend upon the rule of thumb selection of the number of factors that then are rotated (by various methods) to differing definitions of simple structure.  Bech holds that these procedures do not flow from a logical basis that allows firm deductions or sampling inferences.  This defect is affirmed by the lack of factor replication across various samples.

        Bech also argues that the use of factor analysis differs between American and British traditions.  The mathematics of factor and principal component analyses yields a principal factor, marked by consistently positive loadings, and a second orthogonal factor with both positive and negative loadings.  The British tradition uses only the contrast evident in the second factor.  ["In contrast, an American approach rapidly emerged in which factor analysis was used to identify as many factors as possible."]  Bech argues that these factors, even if  ”rotated to simplicity,” cannot be represented by a simple total, since they contain  heterogeneous  items with regard to both  severity and group discrimination.  This impairs their use both as change and diagnostic measures.

        In a clinical trial, some of the items loading a supposedly simple factor may significantly contrast drug with placebo, whereas other items from the same scale do not come close. Therefore, a factorial scale score that sums its items attenuates the distinction between drug and placebo.  This had been noted in a widely unnoticed 1963 paper (Klein DF and Fink M.  Multiple item factors as change measures in psychopharmacology. Psychopharmacologia 1963; 4: 43-52.)

        Katz has reasonably suggested a “multi-vantaged” approach to patient evaluation.  In particular, evaluations are amplified by video recordings that can be “blindly” assessed, by multiple experts, without knowledge of treatment or time of observation.  In addition to the methodological gains, such recordings allow a more fine-grained evaluation of the patient’s physical appearance, verbal flow, affective manifestations, change over time, etc.

        Where Katz seems to part company with Bech is his reliance on scales produced by multiple factor analysis as well as depending on multiple statistical analyses, without correction for multiplicity.  Katz argues (and I agree) that specific tests of antecedently supported and hypothesized effects do not require a “family wise” significance level correction.  However, such specifically stated antecedent hypotheses are not apparent (to me) for many of the claimed findings.

        At one time, long past, NIMH supported methodological advances in psychopharmacology that often benefited from designs using concurrent placebo control groups.  Such clinical trials sufficed, both for demonstrating that specific drug activity existed and gaining FDA approval for marketing.  However, this group average outcome difference does not determine which patients actually require medication for a positive response exceeding their counterfactual response while on placebo.  This parallel group design obscures understanding of this critical issue.

        Both Bech and Katz have addressed this problem.   It was recently suggested that the inclusive clinical trials design promulgated by Chassan may be necessary to solve this problem [Klein DF (2011): Causal Thinking for Objective Psychiatric Diagnostic Criteria. In: Shrout PE, Keyes K, Ornstein K (Eds.) Causality and Psychopathology, New York City: Oxford University Press, pp 321-337].

        A discussion of this specific issue, in the dynamic framework for controversy provided by INHN, would be most worthwhile.


October 31, 2013


Martin M. Katz’s comment on Donald F. Klein’s comment

        Don Klein’s comments on Per Bech’s book are helpful in advancing understanding of the author’s main points and in focusing on some important issues regarding measurement and the manner in which we currently approach the evaluation of new drugs for the mental disorders.  He raised at least three issues that currently obstruct long-term solutions to problems in clinical psychopharmacology, specifically in the conduct of clinical trials.  He also alluded to points in which in my review of Bech’s book I differed with the author.  I believe he agreed with my position that the multiple factor solution is the more effective use of factor analysis in extracting information about drug actions.  At the same time, he is uncomfortable with the actual content of the factor measures, alluding to both Bech’s reservations about factor analysis, generally, his own prior work and to the procedures applied in my “multivantaged” approach.  To further clarify my position, I respond to the issues raised as follows:

1. It is important in all method development to be clear about what, specifically, is to be measured.  I believed the critical problem in current trials research is its almost exclusive focus on evaluation of change in the severity of the overall disorder, to the exclusion of evaluating drug effects on the major components of the disorder.  When I proposed the multiple factor solution, it was because we first sought measures of the facets of psychopathology, measures that were essential to not only assess severity of the overall disorder, but the intensity of each of the major components of that multidimensional disorder.  In that method work then, the target was valid measures of these facets, not of “change” due to a specific drug or drug class.  The componential approach applied, affected how the rating scales were developed, but also highlighted the need to go beyond rating scales to achieve more “objective” measures.  Klein refers, e.g., to the new DSM 5, to illustrate how “ambivalently” the concept of objectivity continues to be treated by the profession.  We, therefore, expanded the measurement approach through use of other psychological methods, e.g., self-report inventories, measures of expression (through video) and psychomotor performance, in order to achieve more valid measures of the components.

2. Klein points to the use of factor measures in drug trials in which the items in a given factor are differentially sensitive to change.  He sees this, rightfully, as obscuring the drug-placebo differences; implying that the factor should only include items that are change sensitive.  If the intention in creating the factor is to develop a “change” method specific to the measurement of the effects of that drug or like drugs, then one might want to confine the factor items only to those that have been demonstrated to be change-sensitive.  The method developed is then targeted to be more sensitive than existing methods, to the actions of that drug or like drugs.  If, however, a new drug is tested that has different effects than the established ones, this particular method will be of limited use in detecting those effects.  The prime intention when working in this sphere has, therefore, been to create measures of the facets of psychopathology of a given disorder or set of disorders, e.g., “anxiety,” that can be applied in the measurement of any type of treatment intervention.  The position is that the “multivantaged” approach is what is most needed now in clinical trials of new agents and in clinical studies in psychopharmacology, generally.  I believe that with colleagues, Alan Frazer and Charles Bowden, we have clearly demonstrated the advantages of that approach, its capacity, e.g., to provide information on the nature, timing and sequence of actions of diverse antidepressants, in a series of studies.  The results of these studies are summarized in my recent, Depression and Drugs book (Springer, NY, 2013).

3. On the third issue, Klein cites the limitation of drug-placebo comparisons by calling attention to the fact that finding a drug effect in a class of disorders does not help with the prediction for any given patient.  Separating the placebo from the drug effect in a patient is an important problem, particularly for clinicians trying to treat treatment-resistant patients.  Klein raises this to initiate discussion of more precise methods for accomplishing that aim.  The prediction literature on that problem for the depressive disorders indicates that, for the most part, we do not have at treatment outset, any reliable specific symptom predictors of drug response.  The work of Szegedi et al. (Early improvement in the first two weeks. J Clin Psychiatry 2009; 70: 344-353), Stassen et al. (Delayed onset of action of antidepressant drugs? Eur Psychiatry 1997; 12: 166-176) and Katz et al. (The componential approach. J Clin Psychopharmacology 2011; 31: 253-254) show that no improvement in the patient within the first two weeks of drug treatment in severity of the overall disorder or in levels of anxiety or hostility results in >90% of patients failing to respond positively at outcome, whereas 70% of those having a positive treatment outcome will have shown significant improvement (>20%) by the end of the 2nd week of treatment.  These relatively new findings do not solve the problem Don Klein raises, but in utilizing “early response” to treatment as a predictor, it is an approach that can help reopen the issue.


December 12, 2013


Per Bech’s reply to Donald F. Klein’s comment

        When reviewing my Clinical Psychometrics, Donald F. Klein recalls the massive criticism put forth by psychoanalysts against measurement-based therapies. With reference to the randomized double-blind trials introduced in the 1950s in clinical medicine, the psychoanalysts found it a meaningless procedure to use rating scales in psychiatry; adding up very different symptoms to give a total score was considered impossible.

        When the Danish statistician Georg Rasch introduced his Item Response Theory (IRT) model in the 1960s, he used the term “specific objectivity” as a general scientific principle in trials of antidepressants when comparing patients from baseline to endpoint by rating scales that fulfilled his criteria of unidimensionality. As outlined by Klein, the Rasch model for specific objectivity is based on Guttmann’s model of scalability, which implies that scorings on lower prevalence items presupposes scorings on higher prevalence items.

        Klein refers to his “widely unnoticed” paper from 1963, in which he demonstrates the great discrepancy between global judgment of change and factor-analytically derived rating scales in placebo-controlled clinical trials of antidepressants or antipsychotics. This is actually a problem of transferability, which is the degree to which a scale continues to measure the same thing psychologically across the different rating occasions during a clinical trial. Responsiveness to change is not a separate dimension, but an aspect of validity for which factor analysis is not able to test. However, because item difficulty is a parameter in the Rasch model, the same difference between two levels of depressive states will be given in the Rasch confirmed rating scales whether the individual item covers mild, moderate or severe depression. This is crucial for measuring changes in placebo-controlled trials of antidepressants or antipsychotics.

        It is, on the other hand, important to point out that Rasch himself was always very careful to examine the nature of the items that did not fulfill his model of measurement. Klein’s chapter from 2001 on causal thinking for objective psychiatric diagnostic criteria actually includes the Rasch reasoning in clinical psychometrics. We need to have a clinically based observation about the dimension we are examining before the psychometric analysis is performed. This holds both for dimensions of depression severity like Klein’s 1963 paper and for predictors of clinical response. The sub-syndrome of panic attacks within anxiety disorder as a predictor of the response to imipramine is such an example (Klein DF, Psychopharmacology 1964; 5: 397-408). Another is the sub-syndrome of atypical depression within major depression. In this case, increased appetite and hypersomnia are symptoms that are both excluded from the Rasch model of depression severity, but both have predictive validity when showing the superiority of phenelzine over imipramine.

        This subsyndromal distinction of atypical depression has not been captured in the antidepressant trials performed over the past decades by the industry because the goal of these placebo-controlled trials is primarily to obtain FDA marketing approval. As concluded by Klein, the group average outcomes on more or less validated ratings scales in these FDA oriented trials do not determine which patients actually require medication for a positive response. We are forced by the fact of more and more patients with treatment-resistant depression to prevent this development by an early recognition of specific sub-syndromes. It is to be hoped that this specific issue will be discussed in more detail in this INHN framework.


January 16, 2014


Donald F. Klein’s response to Per Bech’s reply and reply to Katz’s comment

        Katz's comments are useful in clarifying issues. For instance, he states regarding factor analysis, “it was because we first sought measures of the facets of psychopathology.”

        I do not think that factor analysis can effectively resolve mixtures. That has been a major problem for statistical diagnosis from Lazarsfeld to Meehl. I refer to this problem in my first text. Bech appears to agree, “We are forced more and more (to) early recognition of specific sub-syndromes.”

        Bech also states, “Responsiveness to change is not a separate dimension, but an aspect of validity which factor analysis is not able to test for.  However, because item difficulty is a parameter in the Rasch model, the same difference between two levels of depressive states will be given in the Rasch confirmed rating scales whether the individual item covers mild, moderate or severe depression.”

        I would appreciate it if Bech could refer me to studies where differences in Rasch scores provided effective comparative measures. A comparison to standard techniques, such as ANCOVA, would be valuable.

        Katz agrees that, “Separating the placebo from the drug effect in a patient is an important problem” and that currently we cannot distinguish patients who require medication from those who got better while on placebo.   However his suggestion, “utilizing ‘early response’ to treatment as a predictor… an approach that can help reopen the issue,” seems to have the same problem with mixtures as factor analysis.

        I would appreciate knowing the views of Katz and Bech about “intensive analysis” as such an approach. If it succeeds in isolating patients who require a medication to maintain gains, it seems a step towards homogeneity. Using a number of medications that seem to differ in their proposed mechanisms of action, might further elicit subsyndromes -- although it may require very large samples.

        Katz states, correctly, that scales loaded with items that respond differentially to drug A and placebo, might fail in a study of drug B. However, if therapeutic drug action requires a normalizing interaction with the dysfunction underlying the manifest disorder -- then if on this loaded scale, drug A works but drug B does not -- but drug B has been shown effective, using a different scale -- I believe this amounts to a mixture reduction.


March 27, 2014


Martin M. Katz’s question to Donald F. Klein

        In trying to respond to your critique regarding whether factor analysis or the Rasch approach can resolve the “mixture” problem, I find it unclear about what meaning of “mixture” you are using in this context. Are you asking, e.g., whether the wide range of  symptoms that we observe in depression is the result of a mixture of the underlying syndromes of major depressive and generalized anxiety disorders, as against in the other case, the results of the interaction of independent dimensions uncovered through factor analysis?

        Also, on a related issue, what do you mean by “intensive analysis”?

        It would be useful if you could clarify these concepts so that I can try to provide an intelligible reply. One problem in regard to discussing the mixture issue may be the several meanings we encounter in psychometrics for factor analysis. When using Hotelling's principal components, I would restate that of the factor analytic techniques involved, principal components is characterized as a strictly mathematical approach, based on deriving dimensions generated by the intercorrelations of the factored variables, with investigators confined to minimal interpretation, i.e., interpreting the meaning underlying the most highly “loaded” variables of an extracted component. Factor analysis, in general, in psychometrics can, however, take several forms, several of the techniques relying more heavily on the investigator's choice of the form and on his interpretations at several stages of the procedure. So that the role of factor analysis in relation to the mixture problem may differ as a function of the specific factor analytic approach referred to.


April 17, 2014


Donald F. Klein’s answer to Martin M. Katz’s question

        Marty Katz sensibly raises a central problem in scientific discussion. A word may derive its precise meaning from a particular mathematical or well-defined psychological context. However, in verbal discussion there can be semantic slippage so that terms are misused because in a different context they are now inappropriate.

        Katz gives examples, “One problem in regard to discussing the mixture issue may be the several meanings we encounter in psychometrics for factor analysis. When using Hotelling's principal components, I would restate that of the factor analytic techniques involved, principal components is characterized as a strictly mathematical approach, based on deriving dimensions generated by the inter-correlations of the factored variables, inappropriately requires no maintenance investigators confined to minimal interpretation, i.e., interpreting the meaning underlying the most highly "loaded" variables of an extracted component.”

        However, each loaded variable is a composite of correlated variables, each with a somewhat ambiguous label. Labeling the composite is not due to “minimal interpretation.” Rather, it affords ample grounds for disagreement and misunderstanding.

        Katz continues, “Factor analysis, in general, in psychometrics can, however, take several forms, several of the techniques relying more heavily on the investigator's choice of the form and on his interpretations at several stages of the procedure.”

        “I find it unclear about what meaning of ‘mixture’ you are using in this context. Are you asking, e.g., whether the wide range of  symptoms that we observe in depression, is the result of a mixture of the underlying syndromes of major depressive and generalized anxiety disorders, as against in the other case, the results of the interaction of independent dimensions, uncovered through factor analysis?”

        Also, a good example of communication difficulty, Katz clearly raises the mixture issue, “whether the wide range of symptoms… are due to a mixture of the underlying syndromes… as compared to… the results of the interaction of independent dimensions, uncovered through factor analysis?”

        I do not understand this last clause. Can dimensions be independent, but nevertheless have interactions?  How can we resolve this? My general conclusion is that a complex verbal statement is best illuminated by a simple concrete example. I believe Katz is arguing that some form of factor analysis would produce results equivalent to a model of latent categories. An example would help.

        Katz asks what is meant by inclusive design. This fits very well with the mixture model discussion. The term “mixture” is well defined within modern statistical analysis. Muthen, in his online notes states: “M plus Class Notes Analyzing Data: Latent Class Other Mixture Models. Mixture models are measurement models that use observed variables as indicators of one or more latent categorical (diagnostic) variables. One way to think about mixture models is that one is attempting to identify subsets or ‘classes’ of observations within the observed data. The latent variable (classes) is categorical, but the indicators may be either categorical or continuous.”

        It is often unclear how to model the relationship of outcome to baseline data. For instance, in the 1950s NIMH and the VA hoped that multiple regression analysis might find different treatment relevant diagnoses within an overall diagnosis by using outcome as a validity criterion. Unfortunately, these promising investigations failed on replication and the approach was abandoned.

        Perhaps this was due to the heterogeneity of treatment outcome. This remained unclear in such studies. For instance, a study might find that 60% of medication-treated patients remitted, whereas only 30% of those on placebo did so. Given statistical significance, this was sharp evidence, sufficient for the FDA, that the medication was causally effective. However, identifying the responders who required medication for benefit had not been solved.

        In 1967 J. B. Chassan extensively discussed the issue of how to identify drug responders in “Research Design in Clinical Psychology and Psychiatry” (The Century Psychology Series). However, this concern fell out of fashion, probably because the FDA sufficient successes of the parallel group extensive model design made it seem trivial.

        Chassan’s ideas were revived and extended (Klein D.F.: Causal Thinking for Objective Psychiatric Diagnostic Criteria: A Programmatic Approach in Therapeutic Context, in the monograph, Causality and Psychopathology: Finding the Determinants of Disorders and their Cures, Eds. Patrick Shrout, Katherine Keyes,  Katherine Ornstein, American Psychopathological Association, 2010).

        Chassan recommended “intensive design,” that is repeated periods of intervening and non-intervening, judging whether benefit synchronized with intervention. This concept suggests a different clinical trials design; openly treat all relevant patients with the study medication program, titrating for optimal dose. Patients, who clearly did not respond to treatment, are set aside. Responders would be divided randomly into two double blind groups; either to be weaned onto placebo or to remain on medication. All would be closely followed, double blind, for defined signs of worsening. Sufficient worsening would restart medication. Those who both worsened on placebo substitution and then improved on blind medication retreatment are very likely specific drug responders. In contrast, those switched to placebo, who continued to do well, would probably not be specific medication responders.

        A higher worsening rate among those switched to placebo than those maintained on medication would be clear evidence of medication efficacy, quite comparable to the inference established by the parallel groups, extensive design.

        But better, the intensive design dissects the initial latent mixture into three response specific categories: likely medication specific responders, likely non-specific responders and non-responders. Each group’s meaningful outcome homogeneity, as well as increased heterogeneity between groups, may illuminate the drug’s specific benefit on pathophysiology.


April 24, 2014


Per Bech’s comment on Donald F Klein’s answer

        Two very important issues are raised by Donald F. Klein in our dialogue based on my Clinical Psychometrics, namely the recognition of sub-syndromes in major depression and the dimension of severity on which the clinical effect is measured in trials of antidepressants.

        The research question concerning sub-syndromes is: “On what basis may the experienced psychiatrist say that this person has a type of depressive illness for which a specific treatment is needed?” The research question about the measurement of clinical effect is: “Which symptoms may the experienced psychiatrist assemble when making a global assessment of depression severity?”

        We have previously answered the second question (Bech, Gram et al. 1975) and identified the following Hamilton items used by experienced psychiatrists: depressed mood, work and interest, general somatics (fatigability), psychic anxiety, guilt feelings and psychomotor retardation (HAM-D6). Using Rasch analysis, we showed that this rank order was maintained from week to week in trials of antidepressants (Bech, Allerup et al. 1984; Licht, Qvitzau et al. 2005; Bech, Allerup et al. 2014).

        In our re-analysis of the STAR*D study, we showed that the remission rate for Level 1 on citalopram with the HAM-D6 was 45% (HAM-D6 < 4) versus 36% on the HAM-D17(HAM-D17< 7), P<.01 (Ostergaard, Bech et al. 2014). On Level 2 in the STAR*D study, using HAM-D6 but not using HAM-D17, we showed that bupropion was significantly superior to buspirone as citalopram augmentation in non-responders from Level 1 (Bech, Fava et al. 2011). When demonstrating dose-response relationship of antidepressants, we found HAM-D6 superior to HAM-D17 (Bech 2010).

        Concerning the other research question on sub-syndromes, use of factor analysis is appropriate to classify the sub-types without any basic measurement operation. Thus, the universe of symptoms behind DSM-5 major depression can indeed be combined in many different ways (Ostergaard, Jensen et al. Dec 2011). Sub-syndromes such as atypical depression (hyperphagia and hypersomnia), apathetic depression (tiredness, lack of interests, concentration problems, insomnia) have been identified by principal component analyses. In such sub-syndromes the Rasch model’s requirement of rank ordering or item difficulty is beyond the scope of the psychometric analysis. Here it is the confirmative validity of the items that is in focus.



Bech P. Is the antidepressive effect of second-generation antidepressants a myth? Psychological Medicine 2010; 40: 181-186.

Bech P, Allerup P, Larsen  ER, Csillag C, Licht RW. The Hamilton Depression Scale (HAM-D) and the Montgomery-Asberg Depression Scale (MADRS). A psychometric re-analysis of the European Genome-Based Therapeutic Drugs for Depression Study using Rasch analysis. Psychiatry Res. 2014;217(3):226-32.

Bech P, Allerup P, Reisby N, Gram LF. Assessment of symptom change from improvement curves on the Hamilton depression scale in trials with antidepressants. Psychopharmacology 1984; 84: 276-81.

Bech P, Fava M, Trivedi MH, Wisniewski SR, Rush AJ. Outcomes on the pharmacopsychometric triangle in bupropion-SR vs. buspirone augmentation of citalopram in the STAR*D trial. Acta Psychiatrica Scandinavica 2011; 125: 342-8.

Bech P, Gram LF, Dein E, Jacobsen O, Vitger J,  Bolwig TG. Quantitative rating of depressive states. Acta Psychiatrica Scandinavica 1975; 51: 161-70.

Licht RW, Qvitzau S, Allerup  P, Bech P. Validation of the Bech-Rafaelsen Melancholia Scale and the Hamilton Depression Scale in patients with major depression; is the total score a valid measure of illness severity? Acta Psychiatrica Scandinavica, 2005; 111: 144-9.

Ostergaard SD, Jensen SOW, Bech P. The heterogeneity of the depressive syndrome: When numbers get serious. Acta Psychiatrica Scandinavica 2011; 124: 495-6.

Ostergaard SD, Bech P, Trivedi M, Wisniewski S, Rush J. Fava M. 2014. Brief unidimensional melancholia rating scales are highly sensitive to the effect of citalopram and may have biological validity: Implications for the Research Domain Criteria (RDoC). Journal of Affective Disorders 2014; 163: 18-24.


May 29, 2014


Martin M. Katz’s response to Donald F. Klein’s answer

        Don Klein cites a valid concern about “semantic slippage” when moving from one context to another with various statistical approaches. So, he believes that despite the selection of the most mathematically based factor analysis technique, principal components, there is “ample grounds for disagreement” about the extent of interpretation involved. Although it can be true that “each loaded variable is a composite of correlated variables, each with… an ambiguous label,” it is also true that with certain techniques, the labels or items involved can be unambiguous and straightforward in content.

        In support of my earlier statement that interpretation was minimal with the principal components procedure, I was referring to such examples generated from observational and self-reported mood inventories as “depressed mood-motor retardation.” That title was for a component from our own work, that had in its high loading clusters such items as “looks sad,” “reports feeling down,” “blue,” “motor movements slowed down,” etc., where the additional variables in the component add reliability but no further conceptual complexity to the component. Nevertheless, the dimensions derived with principal components can get somewhat more complicated in concept, so he has a basis for requiring more attention to the degree of interpretation involved in any example, even of this type.

        He then questions in regard to the mixture issue, “Can dimensions be independent but nevertheless have interactions?” To answer this query, one has to step back and examine how the “dimension” is derived. It is originally composed of parts that are shown to be highly linked, with each part having a similar pattern of relationships with other variables that may be part of other dimensions. For example, despite forming the parts of the “anxiety-agitation-somatization” dimension in our work, we note that each part has its own pattern of relationships with variables that make up the composition of other independent dimensions, e.g., anxiety, in itself, a component of psychopathology across most all mental disorders, is known from many studies to correlate significantly (>0.50) with “depressed mood” and with “hostility” (>0.40), items representative of other dimensions. The opportunities for interaction of key parts of different independent dimensions are, therefore, multiple. That is what we found in our studies and was elaborated on in the Depression and Drugs book.

        The interactions in those studies were clear and led to the “opposed emotional states” hypothesis. We believe that the interactions of these states helped to explain, in great part, the psychological turmoil and general stress undergone by the patient. Note that there was no attempt with the principal components analysis to “produce results equivalent to a model of latent categories.”  The aim in that study was not to uncover new “diagnoses,” new subcategories of illness, but to identify and describe the dimensions of psychopathology that structure the “major depressive disorder.”

        Klein provides an interesting discussion of Chassen’s intensive research design. It reminds us that earlier there were alternative approaches to the currently established model for clinical trials.  It is a much more satisfying approach to drug evaluation for the experienced investigator than the mechanical quality associated with the current established model, which relies less on the expert, more on the trained rater. This alternative approach was not taken up by many and is now rarely used because of the intense monitoring and the expertise required of the clinical investigators in the conduct of such studies. He also notes that we were still unable to predict response to any of the drug classes, i.e., which patients respond to which drugs. Despite its scientific advantages, the expense to conduct the intensive trial makes the current established model look more feasible and more modest in its overall costs. Others have advanced ideas to improve the current model.

        The Depression and Drugs book provides another alternative, also applied in earlier trials.  The “componential” model of antidepressant clinical trials includes the use of the established trial’s Hamilton Depression Rating method for evaluating overall “efficacy” but goes further to profile the specific clinical and psychological actions of the experimental drug. The latter step, which requires little additional expense greatly expands the amount of information that can be retrieved from the study of a new treatment, and makes possible the uncovering of actions that although not applicable to the target disorder, may uncover drug actions that are applicable in the treatment of mental disorders, other than depression, e.g., anxiety or phobic disorders. The “intensive design” has a distinct place in the clinical evaluation of new drugs. It still, however, does not achieve what is even more essential when carrying out a major drug trial, that is, the uncovering and quantifying of the specific clinical and psychological actions of the new drug, something that none of the current approaches, including the established model endorsed by the FDA, make a serious attempt to accomplish.


June 5, 2014


Hector Warnes’ Commentary

        This is a 202-page book with 10 chapters, a glossary, 33 appendices, 192 references and an Index. Each chapter is devoid of mathematical formulas, has a strong clinical orientation, an amazing clarity and historical perspective, with countless figures and tables which are highly enlightening and erudite.  On the first page the author cites Karl Jaspers, Aubrey Lewis and Max Hamilton and dedicates the book to Ole Rafaelsen and Erling Dein.  Professor Per Bech expresses his gratitude to Peter Allerup, Professor of Theoretical Psychometrics at the Äarhus University, to Lone Lindberg his research coordinator, to Ove Aaskoven his statistical assistant and several others.

        Per Bech received his medical degree from the University of Copenhagen in 1969.  In 1972 he received a gold medal from Äarhus University for his thesis on cannabis and psychometric tests that included time experience, reaction time and simulated car driving. His doctoral thesis on the validity of rating scales in depression and mania was completed in 1981 at the University of Copenhagen. From 1992 to 2008 he was Professor of Psychiatry at Odense University and since 2008 has held the position of Professor of Clinical Psychometrics at the University of Copenhagen. He is also chief psychiatrist and director of research at the Mental Health Centre North Zealand in Hillerod.

        In the preface the author outlines the Wundt and Kraepelin-inspired Pharmacopsychometric Triangle which consists of three parts: A) changes observed in the clinical effects of a drug administered to patients, with each psychometric scale designed to test a particular cluster of symptoms; B) adverse or side-effects; and C) patients' reported quality of life.

        The author separates clinical psychometrics into two eras. One covering the period from 1879 to 1945, starting with Wilhelm Wundt who was the founder of psychometrics in 1879 along with his two pupils, Kraepelin and Spearman. While Kraepelin attended Wundt’s lectures and his laboratory practices another American doctor, psychologist and philosopher was also in attendance. His name was William James who did not practice medicine in spite of having an MD degree from Harvard Medical School (1872-1879).  James wrote his doctoral thesis on astereognosia and taught comparative anatomy and physiology at the school for many years.

        The modern era of clinical psychometrics was launched with Eysenck, Hamilton and Pichot after 1945, but Professor Per Bech also wanted to acknowledge the contributions of Francis Galton, who founded a London psychometric laboratory in 1884, along with two of his disciples Pearson and Fisher.  Bech also recognized Rasch, Siegel and Mokken who were responsible for the development of psychometric analyses.

        According to Bech, Kant was well aware that behind the phenomenon of pure reason there was another hidden reality and established the division of das Ding für uns versus das Ding an sich or appearance versus the hidden reality. The first led to phenomenology or psychopathology because it is based on events or symptoms as we perceive them in context, time and space when measuring them (quantity, quality, relation and modality). Professor Bech insists that the more experience the psychiatrist has the better he is able to distinguish traits, states, symptoms and gestures. Eventually, we should discover “the unknown” underlying Ding an sich which for Professor Bech points to biological factors.

        Wittgenstein and Quine were considered by Bech as neo-Kantians in so far as they proposed the quantification of endophenotypes in order to sort out the hidden reality. Citing the exhausting imagenological scanning carried out by Nancy Andreasen, the author concludes that, according to her findings, schizophrenia affects many different regions of the brain that cannot be visualized, e.g., das Ding an sich.

        Figure 1.3 in Chapter 1 shows us the earliest symptom check list (sorting cards) devised by Kraepelin.  Use of these cards led him to conclude that there are symptom clusters (“shared phenomenology”) which persist over time and that in an 80% of patients the clusters were different for dementia praecox and manic-depressive illness. Kraepelin even tested the drugs available at the time (morphine, barbital and chloral hydrate) and found that the results were extremely poor in these two major psychoses.

        Later in the chapter, Bech devotes a section to Eysenck, a prominent psychologist at the Maudsley, who was inspired by Jung’s typology (extroversion-introversion) and Freud’s soul-searching studies of neuroticism.  He introduced a Neuroticism Scale (Fig. 1.4) and an Extraversion Scale (Fig. 1.5) which appear to bear similarities to Spielberger’s trait anxiety scale (related to personality traits).

        Personality traits are consistent behavior across situations to be differentiated from other personality models, such as the psychoanalytic model.  The psychodynamic formulation or interpretations between psychoanalysts show hardly any inter-rater reliability.  Further, any psychodynamic formulation of a case when compared with psychopathological measurements does not measure up to tests of reliability. The contextual or situationism (post-stress traumatic disorders) and the interactionism (a circular set of interactions between two people that invariably influence the response of the other) or, in general, the position of the observer, his theoretical biases, his experiences and, not to be dismissed, his Proteus inclinations, should not be overlooked.

        We must keep in mind that psychometrics is not only the use of rating scales but also involves testing the theory behind its findings, its consensual validity and reliability, and factor analytic studies. In fact, psychometric scales were used by Fechner. He was able to measure the quantification of the stimuli and the degree of the psychological reaction to them including words, symbolic stimuli and even subliminal stimuli. We can see that Jung’s “word association test” has been influenced by Fechner’s psychophysics. With time, there was a shift from the subjects' introspective observation of his internal states to the more behavioristic stimulus - overt response paradigm.

        Hamilton was prominent in psychopharmacology in the boom time of the 50s and became instrumental in the development of rating scales which are still in use today.  He also did research, following scientific methods, of placebo-control, random assignment of patients, double blind trials and so on. Hamilton, influenced by Eysenck’s and Spearman’s factor analysis (a factor is one of the bases for structuring the experimental design), was able to differentiate between somatic and psychic anxiety symptoms.

        Pierre Pichot studied psychometrics (in the faculty founded by Alfred Binet) at the Sorbonne immediately after getting his MD degree in 1947 and worked under Professor Jean Delay.  Pichot tested Overall and Gorham’s Brief Psychiatric Rating Scale (BPRS) and pointed out that out of 60 symptoms 18 were sensitive to change during chlorpromazine therapy in psychotic patients and imipramine therapy in depressive patients. In the BPRS there were three subscales: one for mania, one for depression and the other for schizophrenia.

        Professor Bech emphasized the point that classical psychometrics in psychiatry has mainly been influenced by Kraepelin, Hamilton and Pichot, three outstanding psychiatrists. He further noticed that in using the Rorschach test the coefficient of reliability or Kappa coefficient is around 0.50, yet to be clinically meaningful it must be around 0.80.

        Georg Rasch, a Danish Professor of Statistics and Mathematics, wrote his thesis entitled “On Matrix Calculus and its application in Differential Equations.” The psychometric model developed by Rasch and inspired by his studies at the Fisher’s London Institute became the basis of modern psychometrics. On page 35, Bech, based on Rasch postulates, outlined the invariant structure of the six depression symptoms: lowered mood, loss of interest and tiredness followed by anxious mood, guilt feelings and psychomotor retardation. On page 37, Bech citing Rasch writes: “If we want to know something about a quantity, then we have to observe something that depends on that quantity, something that changes if the quantity varies materially. In that case we have a sufficient statistic.” It must be pointed out that other studies have shown cross-cultural differences in this prevalence rate. In some, somatic symptoms predominate, in others guilt feelings and in others suicidal behavior.

        In Chapter 7 Professor Bech offers us an insightful view on Hans Selye’s stress experiments (biological stress models that predict illness behavior), particularly ratings at  the work environment of patients, e.g., being listened to, search for meaning, achievements,  relevant information, social support, recognition, degree of demands and conflicts (Fig. 7.2). Professor Bech also elaborates on the Beck’s cognitive model of depression which is indeed creeping into our society and goes mostly unreported: negative view of the future (hopelessness), negative view of the past (guilt feelings and or worthlessness) and negative view of the present (helplessness) time orientation which are considered to be endophenotypes.

        Heinz Lehmann describes the necessary or invariant core of depression which up to today is unsurpassed and is pointed out by Rash (cited above): 1) reduction of interest (apathy); 2) reduction of capacity to enjoy (anhedonia); and 3) reduction of energy (asthenia), vital core symptoms to be  set apart of the sufficient factors (hopelessness, guilt, somatization, etc.) (p. 801).  It goes without saying that this triad should be present in the absence of organicity.  I understand that the division of organic and functional has been questioned.

        For Lehmann, “the ideal rating scale should be constructed on the basis of both clinical experience and statistical analysis” and, most important, “it must be validated - that is proved that the scale really measures what it claims to measure - and its reliability, both between different raters (interrater) and at different points in time (test-retest) must be demonstrated” (p. 806). I would add that regarding the points of time we should not be satisfied with weekly assessments but in long-term assessment of validity. Lehmann points out that a quantitative measurement of the severity of a disorder, the identification of special patterns or clusters of symptoms and finally an attempt to isolate personality characteristics for the prediction of risk and the treatment response are critical. The latter has developed further in the last decade.

        Professor Bech could have written a more extensive glossary for didactic purposes. The word “clinimetrics” was introduced by Alvan R. Feinstein, Professor of Medicine and Epidemiology. High clinical validity (face validity) means that its questions correspond with the depression symptoms of the DSM-IV.

        The Appendices are outstanding indeed. They add considerable information about the minute and continuous research on the Hamilton’s scales and its analogues, the Montgomery-Äsberg Depression scale (MADRS), the Bech-Rafaelsen Melancholia Scale (MES), the major depression inventory (MDI), accompanied by a critical statement of the missing ítems in this inventory, the Bech-Rafaelsen Mania Scale (MAS), the BPRS, the New Castle Diagnostic depression scale and the modified PRISE questionnaire for side effects of antidepressants and etiological considerations in major depression through use of the Clinical Interview for Depression and Related Syndromes ( F-1 to F-16, CIDRS) (from pages 170 to 175).

        I shall try to complement this Review with comments not addressed by Professor Bech  but cited from the Encyclopedia of Psychology written by Eysenck, Arnold and Meili.  On page 958 of the Encyclopedia the authors write: “A strict definition of scaling must be based on measurement theory”…, in other words, “Measurement consists in transforming an empirical system onto a numerically relational system.” Another way of expressing this is to establish a one-to-one mapping of a relational system to another for the purpose of establishing its validity (in turn, its validity is drawn from apriori axioms). There are one-dimensional and multi-dimensional methods of scaling and in my mind a critical component is the difference between the stimulus centered scaling and the reaction centered scaling which covers both judging individuals and their judgement” (p. 959). 

        Professor Bech, like most psychiatrists, has set himself apart from factor analysis, mathematics, matrix algebra and statistics not denying its scientific basis on its long-term usefulness.  In a  rather pragmatic point of view  a clinical assessment of a patient during an hour would tell us phenomenological far more that a 10-point scale which takes  about 10 minutes to complete. We are at times complacent when the result of  treatment confirms our presumptive diagnoses which should not be taken for granted, because there are many variables at play in the outcome of treatment and when the contrary occurs the word iatrogenia  rarely is mentioned.

        The fifth edition of the DSM, headed by David Kupfer, was launched in 2012 after 20 years of stagnation and realization that, with few exceptions, the genetic, molecular, metabolic and cellular bases of mental illnesses were largely unknown, unlike the impressive advances in other medical fields such as cardiology. Kupfer’s aim was to change from the categorical to a dimensional spectrum of mental disorders. An important breakthrough was the research conducted by Jordan Smoller.  His team, who studied the genome of 33,000 patients who were diagnosed with five different mental disorders, were able to isolate four chromosomal loci associated with five disorders: autism, attention deficit disorder with or without hyperactivity, bipolar disorder, depression and schizophrenia.

        Measuring symptoms and signs is not an easy undertaking unless they are correlated with biometrics and validated illnesses. In measuring a psychopathological cluster of symptoms, one has to evaluate the patients’ ability to communicate with the doctor and with himself (insight versus self-deception) and trust versus mistrust. Otherwise some scales would be inaccurate in their intended measurements.

        I must congratulate Professor Bech for a highly readable publication.  It is well researched, with multi-dimensional and integrative perspectives, and shows him to be a great clinician, academician and researcher.



Cross-Disorder Group of the Psychiatric Genomics Consortium. Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. The Lancet 2013; 381 (9875): 1371-9.

Eysenck HJ, Arnold W, Meili R.  Encyclopedia of Psychology. New York: Continuum Press; 1982.

Kupfer DJ. Regier DA. Neuroscience, clinical evidence and the future of psychiatric classification in DSM-5. Am J Psychiatry 2011; 168 (7): 672-4.

Lehmann HE. Affective disorders: Clinical features. In: Kaplan HI, Sadock BJ, editors.  Comprehensive Textbook of Psychiatry. Volume 4. Fourth edition. Baltimore: Williams and Wilkins; 1985.


November 3, 2016


Per Bech’s reply to Hector Warnes’s commentary

        Hector Warnes has made a very accurate and comprehensive review of the contents of my book, much better than my own. I am very impressed by the background knowledge he shows concerning my heroes like Eysenck, Hamilton, Pichot and Overall. Of them, it was Pierre Pichot who really understood my use of the Rasch item response analysis and who told me that I was the first to use a Rasch rating scale model. This aspect has also been captured by Hector Warnes who emphasizes that this model can only be applied when a clinical validity has been obtained for the scale.

        I agree fully with Hector Warnes that I could have written more extensively about Feinstein’s concept of clinimetrics. In his 1987 monograph, Feinstein refers to the clinical term “improvement after treatment” as an example of a concept with clinimetric importance which no biological marker can measure.

        I also see a clear limitation in my “rather pragmatic view” on very brief, easy-to-use rating scales. There are, as stated by Hector Warnes, many variables at play in outcome ratings of treatment. However, in this situation I would like to recall my meeting with John Overall at the annual NCDEU meeting in 1987.  He chaired a session on his Brief Psychiatric Rating Scale (BPRS) where I presented a 10-item BPRS schizophrenia subscale tested by the Rasch model. After the session, John Overall told me that a brief rating scale should not have more than 18 items and when I had selected the 10 items relevant for schizophrenia I still need some extra “contact-inducing” items to open the interview with the patient, as well as items to finish the interview.

        Actually, in my comment on the Heinz Lehman biography by Barry Blackwell (2015) I mentioned that I have access to Lehman’s videotaped interviews with schizophrenic patients using the BPRS (Bech 2016). These interviews last approximately 25 minutes.

        In the tapes, Heinz Lehman shows how to conduct a flexible, non-structured interview in which his “contact-inducing” items are about sleep and appetite, then comes the specific schizophrenia-relevant items and he finalizes the interview by asking about suicidal thoughts.

        This is what Hector Warnes is so correctly looking for, i.e., even a short rating scale has to be conducted in an interview of at least 25 minutes.

        When evaluating “improvement after treatment” we have to use rating scales referring to the underlying change over time on the latent dimension being measured, i.e., the antipsychotic effect on the BPRS subscale. This clinimetric issue was not included in the DSM-5 because David Kupfer was not successful in adopting the dimensional approach. I agree with Hector Warnes that such a breakthrough in the genetic field, as shown by Jordan Smoller, soon will help us identify the biological markers we need in clinical psychiatry. However, in the intervening time, we have to rely much more on the dimensional approach inherent in the clinically valid rating scales.



Bech P. Comment on Barry Blackwell’s Heinz Edgar Lehmann. February 4, 2016.

Blackwell B. Heinz Edgar Lehmann. November 5, 2015.

Feinstein A.R. Clinimetrics. 1987, Yale University Press, New Haven.


January 26, 2017


Aitor Castillo’s comment

        From the outset, this 213-page book captures the interest of the reader. The edition is clean, precise, clear and orderly. As soon as I opened it, I said to myself:  “This is a book I have to read.”

        At the very beginning, being a clinical psychopharmacologist, I felt frightened of not being capable of understanding these psychometrics’ issues. However, the author reassures us: “(the psychometric procedures) are presented for readers without any requirement of particular mathematical-statistic knowledge” or, in other words, “these models have been amended for readers without mathematical knowledge.”

        By the way, I wonder how Per Bech was able to say so much in just one and a half pages in the introduction section. And I would add that I enhanced my neuroplasticity reading about so many bright people who contributed to the development of such an important area like Clinical Psychometrics.

        I think that in order to take full advantage of this book, the reader needs to be familiar with the concepts and practice of psychopharmacological clinical trials and, of course, to have the necessary insight about the phenomenological and clinical aspects of mental illness.

        I want to concentrate, in a very specific way, on chapter five, “The Clinical Consequence of IRT Analyses: the Pharmacopsychometric Triangle,” because it appears as very useful and evidence-forming to me. The data derived from the review of significant clinical drug trials are of great value to the clinician.

        Through this review, we learn that an effect size higher than 0.40 was only achieved on 10 mg donepezil using the Mini Mental State Examination. If the Alzheimer’s Disease Assessment Scale was used, both 5 mg and 10 mg donepezil achieved an effect size of 0.47 and 0.58, respectively.

        Regarding antipsychotic medications, haloperidol doses of 4, 8 and 16 mg are effective with an effect size of 0.50, 0.73 and 0.55, respectively. At the same time, it is reassuring that even such a relatively low dose as 4 mg of haloperidol causes considerable Parkinsonian symptoms and the highest dose of 16 mg causes very severe side effects without any signs of remission of depressive symptoms and consequently no increase in quality of life.

        Of the utmost importance are the numerous references to historical issues regarding the development and studies of some psychopharmacological drugs and their applications to the treatment of mental disorders. For example, the study performed at the Psychiatric Department of the Danish Rigshospitalet showed that severely manic patients could respond after 6 days of treatment with a fixed dose of 10 mg haloperidol and that the patients with the highest plasma concentration showed the best response. In another study at the University Hospital of Geneva, manic women responded after 14 days on an olanzapine dose of 20 mg and, again, the patients with the highest plasma concentration had the most pronounced effect.

        The author wisely says that it is vital to use the HAM-D6 in dose-response relationship studies, taking into account that many side effects are listed as depressive symptoms in the HAM-D or MADRS.

        Thus, regarding antidepressive medications, it is quite interesting to know that 10 mg escitalopram was an inadequate dose in patients with a marked degree of depression (≥30 at baseline on the MADRS), while both 40 mg of citalopram and 20 mg escitalopram achieved an effect size greater than 0.40.

        A landmark study including the WHO-5 quality of life scale noted that 50 mg desvenlafaxine led to FDA approval. However, the effect size only reaches 0.40 on the HAM-D6 for this dose.  Interestingly enough, the 100 mg desvenlafaxine reached an effect size above 0.40 on the HAM-D17, HAM-D6 and the WHO-5.

        Given the fact that anxiety disorders are among the most prevalent mental disorders, it is necessary to have some insight into the anxiolytic drugs. Fortunately, the author emphasized that 150 mg pregabalin is an inadequate dose, with a HAM-A14 effect size of 0.31, and only 0.20 on the valid HAM-A6. Pregabalin doses between 200 mg and 450 mg gave a HAM-A14 effect size of 0.56 and a HAM-A6 effect size of 0.49. Higher doses did not result in larger effect sizes. On the other hand, the alprazolam effect size is about 0.35 on the HAM-A14 and HAM-A6. Along the same lines, 75 mg venlafaxine showed an effect size of 0.40 on the HAM-A6 and 0.31 on the HAM-A14.

        Finally, a dose of lithium resulting in plasma concentrations between 0.8 and 1.2 mmol/l is most effective for an acute antimanic effect. For antidepressant augmentation, a concentration between 0.3 and 0.5 mmol/l is most effective.

        However, for long-term mood stabilization a concentration between 0.5 and 0.8 mmol/l is most appropriate. In this concentration range, lithium has no sedative effect on the functions relevant to car driving.

        Coming back to a more general perspective, the book also provides a very informative glossary section and an extremely useful appendix including many psychometric scales.

        My conclusion is that Bech’s book is a must read for clinicians who want to attain a comprehensive knowledge of psychopharmacological research and are looking forward to work in this fascinating area of Psychiatry. In this context, clinical psychometrics does an incommensurable service.


December 22, 2016


Per Bech’s reply to Aitor Castillo’s commentary

        As a clinical psychopharmacologist Aitor Castillo has in contrast to many of the other reviewers of my book on clinical psychometrics especially focused on the “Pharmacopsychometric Triangle.” This part of the book was the most central element and I am therefore very pleased to see how Aitor Castillo has metabolized the triangle.

        In psychometric terms we use “responsiveness” of a rating scale or questionnaire to indicate its ability to measure a relevant change when comparing active medicine against placebo in randomized, controlled clinical trials. Clinically we use “response” of a psychopharmacological drug to indicate the symptom reduction in the condition being treated from baseline to endpoint when compared to placebo in such a randomized, controlled clinical trial.

        To express “responsiveness” or “response” we use the new statistics of effect size as emphasized by Aitor Castillo. The statistical significance as expressed by a p value cannot as stated by Cumming (1) provide the answer to a “how much?”, which is the question asked for when measuring a clinical significant “responsiveness” by a rating scale or a questionnaire. Therefore, when using the “Pharmacopsychometric Triangle” we have to use effect size statistics when comparing (A) the amount of reduction of clinical symptoms, with (B) the amount of undesired side effects induced by the psychopharmacological drug and (C) calculating the benefit of the treatment in the patient’s self-reported measure of improved well-being.

        With reference to the “Pharmacopsychometric Triangle” it is, therefore, essential that rating scales measuring the desired clinical symptom reduction do not contain symptoms covering the undesired side-effects of psychopharmacological drugs as correctly emphasized by Aitor Castillo who refers to dose-response relation trials, for example, of antidepressive medicine.

        My recent book on Measurement-based care (2) has actually the “Pharmacopsychometric Triangle” as its platform. It is an attempt to apply in the practical, routine treatment plan with psychopharmacological medication the short rating scales or questionnaires whish have been found valid in the randomized, placebo-controlled clinical trials taking the “Pharmacopsychometric Triangle” into account.


Bech P. Measurement-based care in mental disorders. Springer Briefs in Psychology.
Springer Verlag, New York 2016.

Cumming G. Understanding the new statistics: Effect sizes, confidance intervals, and
metaanalysis. London: Routledge; 2012.


February 9, 2017

                   June 17, 2019

January 9, 2020