OUP user menu

Regression models for analyzing costs and their determinants in health care: an introductory review

Dario Gregori , Michele Petrinco , Simona Bo , Alessandro Desideri , Franco Merletti , Eva Pagano
DOI: http://dx.doi.org/10.1093/intqhc/mzr010 331-341 First published online: 19 April 2011


Objectives This article aims to describe the various approaches in multivariable modelling of healthcare costs data and to synthesize the respective criticisms as proposed in the literature.

Methods We present regression methods suitable for the analysis of healthcare costs and then apply them to an experimental setting in cardiovascular treatment (COSTAMI study) and an observational setting in diabetes hospital care.

Results We show how methods can produce different results depending on the degree of matching between the underlying assumptions of each method and the specific characteristics of the healthcare problem.

Conclusions The matching of healthcare cost models to the analytic objectives and characteristics of the data available to a study requires caution. The study results and interpretation can be heavily dependent on the choice of model with a real risk of spurious results and conclusions.

  • cost analysis
  • skewness
  • zero-cost
  • censoring


The analysis of costs in clinical and public health care has become a standard part of both experimental and epidemiological research. This is motivated by the growing interest in controlling public expenditure, in view of adopting interventions or treatments on the basis of their cost-effectiveness [1]. The healthcare system is obviously very concerned about cost evaluation and control, and predictive models are needed to understand how costs behave as a function of given patients’ or centres’ characteristics. Recent papers have examined healthcare delivery in the terms of interest of this paper, namely, the understanding of how costs are related to a set of given covariates [2, 3]. The relationship of costs to the underlying features of healthcare delivery can be investigated in very traditional fields, like drug utilization [4], or in even more qualitative outcomes, like quality of life or anxiety or depression scores [5]. The conceptualization of the cost process for a single patient or for a healthcare institution in terms of a stochastic phenomenon [6] is perhaps the major explanation for the growth of a vast body of research on appropriate statistical methods and models in this area. Put simply, cost analysis is mainly devoted to:

  1. finding appropriate methods to get the most accurate estimate of the mean costs of treating the disease (clinical setting);

  2. testing for differences among two or more groups of treated patients (experimental setting); and

  3. identifying the patients'/structure characteristics influencing costs and getting an estimate of expected costs, at a fixed point in time during follow-up or disease progression, for specific types of patients (cost profiling in observational settings).

Statistical methods appropriate for comparing new treatments or interventions in terms of their cost-effectiveness have been largely investigated [744] and agreed on [45, 46]. The development of methods suitable for modelling patients’ costs in terms of clinical or socio-demographic covariates has been more recent and less conclusive. This is mainly due to the lack of interest in the cost-prediction exercises, in spite of their great importance for activities planning and optimization purposes [4750].

This article aims to describe some of the main criticisms in multivariable modelling of healthcare costs data and to synthesize the respective methodological approaches proposed in the literature. The described methods are applied to two datasets: one, an experimental setting and the other, an observational setting, respectively, referring to a problem of costs in acute myocardial infarction healthcare delivery and to a diabetes cohort of patients facing repeated hospitalization.

The diabetes cohort and the COSTAMI study

The ‘diabetes cohort’ is a retrospective study analysing the repeated hospitalization in a cohort of diabetic patients [51, 52].

A cohort of all type 2 diabetic patients, resident in the Piedmont Region, attending the Diabetic Clinic of the San Giovanni Battista Hospital in Turin (Piedmont Region, Italy) during 1995 and alive on 1 January 1996 was identified (n = 3892). Mortality and hospitalization follow-up was carried out up to 30 June 2000. The patients were included in the study if they had at least one hospitalization in the subsequent years of follow-up. The final dataset included 2550 patients. A total of 4816 ordinary and 2183 daily hospitalizations were observed during the 4.5 years of follow-up. About 13% of the sample (342 patients) died during the follow-up. Demographic data (age, gender) and clinical data relative to the year 1995 (years of diabetes and number of co-morbidities) were utilized for the present analysis. The costs (in Euros) for daily and ordinary hospitalizations were calculated referring to the Italian DRGs (Diagnosis Related Groups) system. These data were analysed in an earlier paper [52], where all details on the cohort can be found, and are here used as support for some methodological comments.

The COSTAMI (COST of strategies After Myocardial Infarction) study [53, 54] compared an early discharge strategy based on stress echocardiography (EDS) with standard care based on clinical evaluation (CE) and post-discharge exercise electrocardiography (ECG). Patients were randomized either to an early discharge strategy (Day 3–4) after a stress EDS or to CE with a traditional discharge after 7–9 days of observation. Direct medical costs were collected for the 458 enrolled patients during a 1-year follow-up. The main analysis [53] was done fitting a model for costs as a function of the strategy, adjusted for the presence of known factors increasing the likelihood of events in the follow-up: gender, age, ejection fraction (EF), presence of antero-lateral ECG modifications at admission and presence of diabetes. The distribution and effects of all the variables, as well as further details about the general structure of the trial and detailed information on the observed variables are illustrated in the main paper [53].

Models for costs

The estimation of total medical cost is not straightforward, particularly when the goal of the analysis is to relate costs to a specific pattern of covariates. Indeed, accurate cost estimation is problematic when cost records are incomplete, because censoring could lead to biased estimates of costs, unless appropriately accounted in the analysis [55].

The main problems in analysing costs are the following (for a short definition of the terms used, please refer to Table 1):

  1. an asymmetry of the distribution (Fig. 1), due to a minority of subjects with high medical cost compared with the rest of the population;

  2. a possible large mass of observations with zero-cost;

  3. a common presence of censored observations not satisfying the condition of being independent and not informative—this condition is needed because the individuals still under observation must be representative of the population at risk in each group, otherwise the observed failure rate in each group will be biased; and

  4. frequent violation of the assumption of proportional hazards, particularly when costs are accumulated at different rates [56]. This is clearly seen in the COSTAMI data accumulation process (Fig. 2).

View this table:
Table 1

Glossary of the main statistical terms used in the paper

SkewnessSkewness is a measure of the asymmetry of the distribution of a variable. It can be positive or negative or zero. A negative skewness (a positive the opposite) indicates that the tail on the left side of the distribution is longer than the right side and the main part of the values (including the median) lie to the right of the mean. A zero value indicates that the values are relatively evenly distributed on both sides of the mean, and it is a characteristic of the normal distribution
HeteroschedasticityHeteroschedasticity occurs when the error variance is not homogenous. It results in inefficient estimators and biased standard errors, rendering the t-tests and confidence intervals unreliable
CensoringCensoring occurs when the value of an observation is only partially known. A typical example is the survival process, where survival is usually known only up to the end of the follow-up time, commonly shorter than survival if no final events occur. In this case the survival time is said to be (right) censored
Zero-costsZero-costs indicate that according to the definition of cost adopted in the study (for example hospitalization costs), no actual costs have been recorded for that patient (e.g. because no hospitalizations occurred) and thus to the cost variable is given a value of zero
Figure 1

Cost distribution for the diabetes cohort, with zero-cost patients (left plot) and without (right plot).

Figure 2

Cost accumulation over the 1-year follow-up of the COSTAMI trial.

With reference to these characteristics and particularly to the presence of censoring, several studies in the literature [55, 57] have proposed using survival models such as the Kaplan–Meier and the Cox regression model, based on the conceptual similarity between costs and time, both being continuous non-decreasing variables. However, the assumptions behind the survival models listed above are often violated in cost estimations.

The same definition of censoring is sometimes not homogeneous in the literature on regression models for cost data (Table 2). Apart from the misleading approach of not considering censoring in the analysis when present, two main definitions are adopted. In the first case, the costs are accumulation up to a pre-specified point in time. In this case, the death of the patient is seen as a failure in observing the cost accumulation. In the second case, a more standard definition is adopted, when people are ending the follow-up, thus interrupting their cost accumulation, without events [55].

View this table:
Table 2

Definitions of censoring as appearing in the literature regarding cost data

AnalysisCensoring definitionCaveats
AdministrativeCost till death [76]Only dead patients have complete follow-up historyCost and survival are closely related
Loss at follow-upCost till deathOnly dead patients have complete follow-up historyPossible informative censoring
Death censoringCost up to a pre-specified time [55]Only patients alive at the end of follow-up are uncensoredInformative censoring
No-censoring (actual data)Observed costsDownward bias in cost estimation

The three main issues (skewness, zero-cost and censoring) highlighted above and regarding the models for cost data can be used to classify methods and models used to deal with them (Table 2).

Studies with skewness-only problem

Without censoring and zero-costs, the issue in modelling healthcare costs is basically an exercise of model fitting for highly skewed data. In cost analysis, things are complicated by the fact that additivity is a property of the model that is commonly considered essential: costs are indeed added together to build up the total expenditure. In this sense, the mean cost is a more meaningful measure than other more robust alternatives, like median or geometric mean cost. Nevertheless, the estimate of the mean cost, possibly as a function of a set of covariates, must be as robust as possible to outliers or asymmetries in the data distribution. In our diabetes data, the effect of skewness can be largely understood. As shown in Fig. 1, the cost distribution is very skewed, and the resulting estimate of its centre is 3913€ when it is based on the median, but it is dragged to the right by about 100% (up to 7013€) when the same estimate is based on the mean. This is something to be borne in mind when considering the perhaps simplest model, the linear model, fitted either via ordinary least squares (OLS) or maximum likelihood (ML), which assumes the following form for the costs Embedded Image where ci (i=1, … , n) is the cost observed for each patient, and Embedded Image are the Embedded Image Embedded Image covariates and Embedded Image are the Embedded Image corresponding regression coefficients estimated with the least squares method. Residuals are commonly assumed to be normally distributed. To reduce skewness in the residuals, several transformations of the response variable have been proposed in the literature. These transformations are commonly able to gain a reasonable normalization effect even in presence of highly skewed data. Notice that the approach of transforming the costs requires in any case a back-transformation at the moment of interpreting results, causing several additional problems [58], partially avoided using approaches like the ‘smearing’ estimator [59].

In the case of the Box–Cox transformation [60], which is one of the most used within the uniparametric transforms, with a parameter λ Embedded Image Although very powerful in obtaining a final symmetric distribution as shown in Fig. 3, where several choices of the λ parameter are fitted, normality is still assumed. In addition, when the back-transformation is performed, the bias eventually occurring Embedded Image can be proven to be a function of the variance function. Thus, if heterogeneity is present, additional efficiency and inference problems are present on the transformed scale.

Figure 3

Box–Cox transform varying λ (see the section Studies with skewness-only problem) for the diabetes cohort.

A particular case of the Box–Cox transform is the lognormal model, where the transformation is the Embedded Image. Supposing to have two treatments j = 0, 1, it happens that Embedded Image. Thus, a test of the null hypothesis of equality of means H0: γ1γ2 = 0 is perhaps a test for the geometric means. This is clearly shown by observing that the null hypothesis on the expected values of the transformed costs E(C) is Embedded Image which implies that H0: γ1γ2 = 0 only if Embedded Image. Again, this ends up with a change in the meaning of the test, because the test of interest (the equivalence of the mean costs in the original scale) is equivalent to the test performed on the transformed scale only in case of homogeneity of variances in the treatment groups.

To understand how the interpretation of coefficients changes, consider the simple case of a linear model relating costs with the number of co-morbidities, i.e. Embedded Image where c are the costs and N the number of co-morbidities. In this case, the estimates of the coefficients are 2155.15 (SE 1220.02) for the intercept and 2946.98 (SE 474.93) for the slope. These coefficients are directly interpretable as ‘mean effects’. This is no longer true when the model is written as Embedded Image because in this case the slope coefficient is not an average effect.

A common solution to the issues of outliers is the so-called threshold model, where the probability of having costs in excess of a given threshold is modelled using commonly a logistic function. Two different cut-off points in the costs distribution are generally used: the median Embedded Image and the third quartile q3, leading to the model Embedded Image The model estimates nothing but the probability of having a cost greater than the median (third quartile) as a function of the observed covariates. Although it does not require normality and it can also work for heavily skewed distributions, it does not give an estimate of the mean costs and the conclusions are sensitive to the threshold chosen, which is sample dependent when based on percentiles. Obviously, no information is provided for the behaviour of the covariates in affecting the cost distribution: again, estimating the simple model for the number of co-morbidities Embedded Image where 3913 is the median cost, yields to the final estimate of Embedded Image and finally, for a person with two co-morbidities, to the estimate Embedded ImageA vastly better solution, proven by simulation to be highly preferable [61], is based on the generalized linear model (GLM) [62], which solves the open issue in the response transformation that in general Embedded Image by the transformation of the expectation [63] Embedded Image where the distribution for the response is usually taken to be Gamma() and the link function is usually taken: (i) for additive effects, as the identity function I(), and (ii) for multiplicative models as the log(), avoiding in this case the bias in the back-transformation [64]. This approach is highly flexible and can accommodate a variety of distributions maintaining linearity of the effects on the response scale. In addition, the choice of transformation can be completely parameterized [65], moving toward a data-driven modelling strategy.

Studies with zero-costs

The issue of presenting a possible large mass of observations with zero-costs is based both on model inadequacies and on conceptual issues. From the model side, the application of standard models, in particular the OLS-based ones, can lead to predictions based on a negative model. From the conceptual point of view, it could be highly questionable that two populations (one with zero, the other with positive costs) have the same behaviour with respect to the covariates.

A common solution widely adopted in the applied literature to address the ‘model’ issue is to add a positive constant k to the costs, modelling therefore the log(ci + k), usually in an OLS framework. Apart from the big advantage of being easy to implement, this solution has the drawbacks of having poor behaviour and an arbitrariness in the choice of k, of aggravating the problem of back-transformation [66], and of not taking into account different behaviours of patients with zero-costs.

To address the latest issue, an elegant approach is based on the concept of latent variable, where  an underlying variable z is introduced which can be modelled as z=β0 + + ɛ, but it is only observable as a binary outcome y= 1, if z > 0, and y= 0 if z ≤ 0. To add flexibility in the modelling strategy, the Tobit model [67] does not require the observable variable to have a binary form, modelling the observed variable c = max(0, z), where z = xβ + u, u|x ∼ Normal(0,s2). The Tobit model uses ML to estimate both β and s for this model. Table 3 shows the estimates for the diabetes cohort.

View this table:
Table 3

Models used in cost-analyses

SkewnessZero-costCensoringMean estimation Graphic
Original scale models
 OLS (ci)X
 Tobit/adjusted TobitXX
 GLM (gamma, log-gamma)XOX
Transformed response
 OLS log(ci+ k)OO
 Threshold logit modelsXX
Survival models
 Parametric (Weibull)XXX
 Semiparametric (Cox Proportional hazard)XXO
 Non-parametric additive regression [80]XXX
Mixed modelsXXX
Weighted regression [73, 86]XXX
  • X = condition satisfied, O = condition partially satisfied, blank = condition not satisfied.

It is important to realize that β estimates the effect of x on z, the latent variable, not on the observed cost process c. In addition, unless the latent variable z is of interest, it is just not possible to interpret the coefficient directly, because E(c|x) = Φ(/s)+ (/s), so that the effect on the observed costs is very hard to interpret, being ∂E(c|x)/∂xj = βj Φ(/s). In addition, Tobit models have been shown to be very sensitive to departures from normality or homoscedasticity [68], which can be easily detected by a simple residuals plot, like that shown in Fig. 4, where normality is shown to hold.

Figure 4

Residuals plot for the Tobit model for diabetes cohort (Table 4).

View this table:
Table 4

Tobit fit to diabetes cohort

ValueStd. ErrorzP
Gender (M vs. F)−62.85424.3257−0.1480.88
Years of diabetes50.4825.01922.0180.04
Number of co-morbidities ≥ 12134.09605.16033.526<0.001

A class of models that explicitly takes into account the different nature of the populations, the one with positive and the other with zero-cost, is the so-called mixed model [69], where the conditional expectation is partitioned as Embedded Image Thus, expectation is split in two parts, the first modelling the probability of any use or expenditure, based on the full sample and usually modelled via a logit or probit GLM, and the second modelling the actual level of expenditure conditionally to c > 0, usually via an appropriate model in the linear or GLM class. Estimates of mean costs thus are obtained using the smearing estimator, taking the mean of the exponentiated residuals [66] Embedded Image The different effects of the same covariates on the two model parts are clearly seen in the diabetes cohort, where the first part has been modelled through a standard logit model and the second via a linear OLS model (Table 5).

View this table:
Table 5

Mixed model fit to diabetes cohort

ValueStd. Errort-valueP-value
Logit model
 Years of diabetes0.020.005.84<0.001
 Number of co-morbidities ≥ 10.690.116.45<0.001
OLS model
 Years of diabetes49.8324.242.060.02
 Number of co-morbidities ≥ 12596.41566.674.58<0.001

The marginal effect on costs for a given covariate xh can be calculated computing the partial derivatives of the marginal expectation Embedded Image In the case of the diabetes cohort, we have that P(c > 0) = 0.54 and E(c|c > 0) = 7509.82, implying that for the covariate ‘Year of diabetes’, the effect on the marginal expectation can be computed recalling (Table 5) that βlogit=0.025 and βOLS=49.83, obtaining that the costs are increasing on average by €208 per each additional year of diabetes history. Note that two-part models can suffer from the problem of back-transformation if the model part for the continuous response is based on a transformation function, e.g. a logarithm transform [70]. To avoid this, the substitution of the OLS model part with a Gamma model is a valid option [71].

Figure 5

Aalen estimates of cumulative effects of the EF (left plot) and gender (right plot, female vs. male) on overall costs up to 1 year after discharge, in the COSTAMI trial. Dotted lines are 95% confidence intervals.

Studies with censored data

To adjust for censoring, the basic idea is to weight the costs for the inverse of the probability of being alive, mimicking the basic Horvitz–Thompson estimator, where each observation is weighted by the probability of being observed [72].

Thus, the basic estimator is [73] Embedded Image where δ is the censoring indicator, M(t) is the cumulative cost up to time t and K() is the Kaplan–Meier estimate.

Bang–Tsiatis proposed an improved version accounting for cost-history lost due to censoring, allowing M(t) and K() to be estimated in each of the K intervals, defined as in Lin [74] Embedded Image The two estimators have been shown to be equivalent [75, 76]. Improved versions of the basic estimators have been proposed in recent works using bootstrap approximation of confidence interval, which has been shown to have much better coverage accuracy than the normal approximation in the presence of a heavily skewed distribution [77]. When there is light censoring on medical costs (<25%) the bootstrap confidence interval based on the simple weighted estimator is again preferred due to its simplicity and good coverage accuracy. For heavily censored cost data (censoring rate >30%) with larger sample sizes (n > 200), the bootstrap confidence interval based on the partitioned estimator has superior performance in terms of both efficiency and coverage accuracy [77]. Clearly, the adopted definition of censoring (Table 2) affects estimates and their precision. In the diabetes cohort (Table 6), mean costs are highly overestimated when Dudley's definition is used. On the contrary, ignoring censoring is also a source of bias, underestimating actual average costs (Table 6).

View this table:
Table 6

Estimates of the mean costs for the diabetes cohort according to the various definition of censoring adopted (Table 2)

Mean estimateSE
Lin estimate (administrative censoring)5856249
Cox estimate (death censoring at 4 years)338961249
No-censoring estimatea4488.18129.44
  • aGLM with Gamma distribution and log link function.

A popular approach to modelling censored cost data were to fit survival models on cost data on the basis of the well-known idea of shifting the application of survival techniques from a ‘time’ to a ‘costs’ framework, using the actual similarities between the two domains: (i) the non-negative values and (ii) the non-decreasing behaviour. After the first paper introducing the use of the Cox model [55], several criticisms have been raised pointing out the failure of such a model in accounting for non-proportionality in the cost-accumulation process in the presence of censoring [56]. Non-proportionality occurs when the risk of observing costs greater than any given value does not increase linearly with covariates’ value. Anyway, in using the survival models in the cost analysis, we define the cost function as Embedded Image where Embedded Image are the costs for each subject, that is the probability that the cost is greater than a certain cost Embedded Image.

The parametric survival model assuming Weibull distribution and the Cox regression model relate the hazard at each cumulative cost Embedded Image to the covariates: Embedded Image where Embedded Image is the hazard rate of a cost Embedded Image for an individual Embedded Image with a covariates vector Embedded Image, and Embedded Image is the baseline risk function. In the Weibull regression model the baseline Embedded Image is assumed to be distributed according to the Weibull distribution, and in the Cox model no assumptions are made about the baseline function. The interpretation varies accordingly: in the case of diabetes data, the estimate of the coefficient for the number of co-morbidities (0.4829, SE 0.0051) indicates that the risk of accumulating further costs is equal to increasing 1.62 times for each additional co-morbidity.

More flexible models have been introduced, as in the non-parametric additive regression approach, where the risk of having an excess of costs (the hazard function) is not modelled using a fixed functional form. One of the most popular models in this class, the Aalen model [78, 79], assumes that the variables interact in an additive manner on the risk function. In this case, the model gives an estimate of how the risk of having costs greater than any given cost increases or decreases in association with the covariates. In the Aalen model the hazard function can be expressed as: Embedded Image where the hazard rate Embedded Image is a linear combination of the variables Embedded Image and the weights Embedded Image are Embedded Image functions estimated from the data. The slope of each Embedded Image cumulative regression functions demonstrates the weight of each covariate on the hazard function, while the costs are on the x-axis. In the case in which the covariate has no effect, the function should be a straight line near zero [80, 81] (see Figure 5). In such models, the interpretation of the regression parameters changes and is no longer related to a (possibly transformed) response, as with reference to the diabetes cohort (Table 7). The regression parameters represent the effect of a given covariate on the probability of interrupting the cost-accumulation process after having reached a given cost c. Note that in the diabetes cohort also the standard Grambsh–Therneau test for proportionality is failing.

View this table:
Table 7

Effects on cost-hazard for different covariates (95% confidence intervals) in the diabetes cohort

Age1.32 (1.27–1.39)1.19 (1.14–1.23)
Gender (M vs. F)1.11 (1.04–1.19)1.06 (1.00–1.14)
Years1.20 (1.14–1.28)1.16 (1.10–1.23)
Number of co-morbidities ≥ 11.64 (1.47–1.82)1.47 (1.33–1.61)
  • Gramsh–Therneau test for proportionality 11.7, P = 0.01. Coefficients have been re-parameterized to show a direct effect on costs of covariates. aCoefficients estimated via a Cox model both considering censoring and ignoring it.

Finally, survival models can also be embedded in a two-part model to account at the same time for zero-costs and censoring [69].


Several regression models for costs have been proposed in recent years, and now the menu of available approaches is much broader. However, there is no unique model that is able to deal with all the problems that can arise in the analysis of cost data. Therefore, the final choice depends on the type and design of the study.

On the other hand, adopting an approach based on deep modelling of the response variable, with the aim of reducing bias in the cost estimates, is also questionable, because it tends to produce results likely to be model dependent. Recent work has been done to deal with such an issue from the perspective of study design [82, 83], reducing in particular heteroschedasticity by more cautious study design and planning. Nevertheless, the strongest tool to avoid an excess of modelling is the proper use of model diagnostic tools, both as exploratory methods for understanding the degree and direction of skewness, and as way of checking if the fitted model is efficient and robust in statistical terms. Our paper shows indeed that several strategies can be taken to approach the issue of modelling costs. The evidence provided is of course anecdotal, being based on a simple set of two studies, but we believe that these examples are general enough to allow an easy translation of our thoughts to other situations and problems. We offer the following general recommendations. In case of no censoring and no zero-costs, a vast body of literature exists now in favour of the log-gamma GLM model [61]: this allows retention of all the benefits of log or Box–Cox transformation, avoiding troubles in the interpretation of coefficients and avoiding back-transformation issues. To avoid such problems in a different perspective, the survival models and the Cox model have been widely discussed in the literature. Unfortunately, they have been proven to be biased in the presence of even small departures from the underlying assumptions. Survival models are, however, still of interest for a best-fitting strategy, as shown by some recent works, which, using an additive approach to cost data modelling, have a fit comparable with the gamma model with an increased flexibility in modelling extreme costs [80, 81, 84]. Regarding the zero-cost issue, the two-part or even the three-part model is perhaps the most informative, even if the issue of best fit for each part of the model is crucial. Inefficient estimates can heavily affect the marginal effects of covariates also. Without entering in the huge amount of work done with reference to the appropriate treatment of censored data, the Lin-based estimators are outperforming the simpler solutions based on survival techniques.

In any case, the paper clearly shows that alternatives exist to the very simple linear regression model [85], and, given such model's weaknesses, the biases it may introduce and its inability to cope efficiently with the challenges provided by the characteristics of the cost variables, its still large use in the applied literature on cost estimation in health care is hardly justified.


The study was financially supported by an unrestricted grant of the Compagnia di San Paolo (Torino).


View Abstract