International Journal for Quality in Health Care Advance Access originally published online on June 14, 2007
International Journal for Quality in Health Care 2007 19(4):183-186; doi:10.1093/intqhc/mzm024
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Editorial |
Publishing quality measures: how it works and when it does not?
E-mail: hamblinsusa{at}yahoo.com
Although efforts to measure the quality of healthcare are longstanding, over the past 10 years, there have been increasing efforts to both measure the quality of healthcare and publish the resulting data in many industrialized countries. This may simply be the part of a broad trend in western societies towards greater openness and accountability that seems particularly evident in the realm of human services such as health and education. However, publication is also expected to encourage improvements in the quality of care.
Improved quality is expected to happen in one of two ways. Having performance information easily available may encourage greater consumerism among patients so that they seek out the best providers, a premise which underlies the stated policy of both US and UK governments. As yet the evidence for this happening is generally scanty and equivocal [1, 2]. In addition, the threat that patients may start to use presumably reliable information to choose providers, or even a natural competitiveness among clinicians, may stimulate providers to improve their services regardless of patients responses'. This second reason for publication gains little official recognition, reflecting an element of subterfuge in a policy, which has the happy characteristic of having two chances of working. If patients will not respond en masse to the information, then providers might.
The belief that publication will stimulate provider actions to improve depends upon exploiting a range of explicit and implicit incentives. Proponents expect that rewards and sanctions for individual providers will follow the publication of information about quality, either organically (e.g. the reporting itself confers kudos or censure on the provider) or through their explicit application by purchasers or regulators in response to the data. Schemes as diverse as Star Ratings in England [3], the Quality and Outcomes Framework throughout UK [4], the Wisconsin Quality Counts initiative [2], and the Quality Benchmarking and Dialog program in Germany [5], while reflecting their local conditions and using different incentives, are all examples of what I term incentivized measurement schemes.
However, provider responses to incentivized measurement schemes to date have been mixed, with unintended consequences noted. One of the reasons for this may be that the design of the schemes did not consider the complex range of incentives that may be activated by the measurement and reporting of quality. At least some of the unintended consequences reported may be caused by an uncertainty of precisely which incentive is being activated, and thereby activating the wrong one.
Table 1 proposes a typology of incentives that may be activated by measurement schemes. These range from the intrinsic motivation of medical professionals to do good and to do it well, to the more explicit incentives seen in pay for performance schemes directing a provider to do x and get y. In fact, most schemes tend to use a mix of these incentives. For example, English Star Ratings used a mixture of direct incentives (£1 m for high performing hospitals) and implicit system incentives (the kudos of working for high performing organization and the censure of working for a poor one). Another widely studied example, the publication of cardiac mortality data in New York State, has the same implicit incentives as Star Ratings, but supplements this with the indirect incentive of the potential to increase market share through consumers' responses to the published data [6].
|
This sort of typology is in itself quite straightforward, but complexity follows the fact that these incentives are felt by organizations and individuals that behave in very different ways. Reflecting upon the English experience of central targets, Bevan and Hood [7] suggested the classification of organizations and individuals set out in Table 2. One might allow that the truly unpredictable rational maniac is a rare beast, yet concerns about the sort of gaming that denies care to the sickest patients, distorts clinical priorities and encourages creativity in recording performance have been expressed about schemes in USA and UK regularly enough to suggest that encouraging reactive gaming is a serious risk. Further if such behaviour is not identified and discouraged, it will be seen to be rewarded, and a race to the bottom ensues where a majority of organizations see it as being in their interests to act as reactive gamers. As a consequence, the measurements themselves lose credibility.
|
Examples of unintended consequences abound. The publication of mortality rates for cardiac surgeons in New York State has been linked with discouraging surgeons from treating the sickest patients [6, 8]. In UK, targets for admission or discharge times in Accident and Emergency (A&E) Department attendees have been linked with the use of medical admission units as a method of conveniently re-labelling patients to ensure that they are no longer in the emergency room when their time runs out [9]. A survey by the British Medical Association in 2003 found that in more than half of English hospitals A&E consultants believed that clinical priorities were distorted to achieve this waiting target [10].
Arguably as damaging as direct negative consequences to patient care is encouraging a culture of dishonesty in the recording of data. These range from the subtle to the almost comically obvious. The former includes up-coding, altering recording practice to make individual cases look as complex as possible, which both maximizes income and creates advantages in risk-adjustment [11]. More straightforward falsification of data can be seen in the measurement of the time ambulances took to respond to potentially life-threatening emergencies in England. In about one-third of services, responses of just in excess of the target of 8 min were regularly corrected to be just inside the target time, creating clearly impossible distributions of response time [12].
Often these problems are exacerbated by weaknesses in the measures themselves. Deficiencies in the reliability, validity and scope of measures create scepticism about their value and encourage perverse responses. For example, even with case mix adjustment, measures are often unreliable for small populations. Hofer et al. [13] estimated that measures of the quality of diabetes care were so unreliable that simply removing the three patients with the highest HbA1c measures from a physician's panel of patients would have dramatically improved the performance of apparent outliers. Such large effects from comparatively small changes show both the problems of making judgements from an unreliable measure and why the temptation to game is so powerful. Another study reviewed six different performance measures in UK [14], each of which purported to measure the same underlying health care constructs, but found little correlation between results, leading the authors to question the validity of the measures' construction. Finally, measurements that concentrate on only one part of a system of care, may force rigid management to measure responses which have negative effects elsewhere in the system, creating situations in which overall quality of care or outcome actually declines despite an apparently useful measure being achieved [15].
There are two distinct types of solutions to these problems. First, the articulation of explicit typologies of incentives and organizational types may help in thinking through exactly how a scheme might work. What mix of incentives, explicit and implicit, is activated by the scheme? How might each type of organization respond? What are the loopholes and flaws that a reactive gamer could exploit? What is the worst-case scenario? The second set of solutions lies in the technology of measurement. For example, single measures and measures that have a single threshold point of success or failure are both more susceptible to gaming and falsification and more likely to create perverse effects elsewhere in the system. Conversely, networks of related measures and measures which consider more than one point of a distribution may minimize the capacity for gaming. Similarly, some degree of uncertainty in the measurements (where the overall objective is known by those being assessed, but the precise measure of success is not) may reduce the incentive to game at the margins.
Incentivized measurement schemes are a promising strategy for encouraging improvement in healthcare. But they are not a panacea, and they risk encouraging unintended consequences that lead to only the impression, rather than reality, of improvement. Designing them requires consideration of the provider's incentives to respond to the measurement and publication of quality, and no little technical expertise. Bringing this discipline to bear increases the likelihood of designing a successful scheme and avoiding several well-documented pitfalls.
2006/07 Harkness Fellow,
Group Health Cooperative of Puget Sound
Acknowledgements
Support for the research that underpins this editorial was provided by The Commonwealth Fund. The views presented here are those of the author and should not be attributed to The Commonwealth Fund or its directors, officers, or staff.
References
- Marshall M, Shekelle P, Leatherman S., et al. The public release of performance data what do we expect to gain? A review of the evidence. JAMA (2000) 283:186674.
[Abstract/Free Full Text] - Hibbard J, Stockard J, Tusler M. Hospital performance reports: impact on quality, market share and reputation. Health Aff (Millwood) (2005) 24:115060.
[Abstract/Free Full Text] - Mannion R, Davies H, Marshall M. Impact of star performance ratings in English acute hospital trusts. J Health Serv Res Policy (2005) 10:1824.
[Abstract/Free Full Text] - Doran T, Fullwood C, Gravelle H., et al. Pay for performance programs in family practices in the UK. N Engl J Med (2006) 355:37584.
[Abstract/Free Full Text] - Veit C. National Quality Benchmarking in Germany the Structured Dialog, The Commonwealth Fund 2005 International Symposium on Health Care Policy. http://www.allhealth.org/briefingmaterials/VeitC_2005-10-31-391.PDF(16 April 2007, date last accessed).
- Chassin MR, Hannan EL, De Buono BA. Benefits and hazards of reporting medical outcomes publicly. N Engl J Med (1996) 334:3948.
[Free Full Text] - Bevan G, Hood C. Governance by Targets and Terror: Synecdoche, Gaming and Audit, Westminster Economic Forum. 16 April 2007. http://www.publicservices.ac.uk/materials/Governance%20by%20Targets.ppt.
- Omoigui N, Miller D, Brown K., et al. Outmigration for coronary bypass surgery in an era of public dissemination of clinical outcomes. Circulation (1996) 93:2733.
[Abstract/Free Full Text] - Mayhew L, Smith D. Using Queuing Theory to Analyse Completion Times in Accident and Emergency Departments in the Light of the Government 4 hour Target (2006) City of London: Cass Business School. Actuarial Research Paper No.177, ISBN 978-1-095752-06-5.
- British Medical Association. BMA survey of A&E waiting times 2003. http://www.bma.org.uk/ap.nsf/AttachmentsByTitle/PDFAEsurvey/$FILE/AEsurvey.pdf(16 April 2007 date last accessed).
- Carter G, Newhouse J, Young S. How much change in the case mix index is DRG creep? J Health Econ (1990) 9:41128.[CrossRef][Web of Science][Medline]
- Commission for Health Improvement. What CHI has found in: ambulance trusts. 16 April 2007. http://www.healthcarecommission.org.uk/_db/_documents/04000052.pdf.
- Hofer TP, Hayward RA, Greenfield S., et al. The unreliability of individual physician "report cards" for assessing the costs and quality of care of a chronic disease. JAMA (1999) 281:2098105.
[Abstract/Free Full Text] - Brown C, Lilford R. Cross sectional study of performance indicators for english primary care trusts: testing construct validity and identifying explanatory variables. BMC Health Serv Res (2006) 28:81.
- Fee C, Weber E. Identification of 90% of patients ultimately diagnosed with community-acquired pneumonia within four hours of emergency department arrival may not be feasible. In: Ann Emerg Med (2007) 49:553559.[CrossRef][Web of Science][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||